Nice post (and I only saw it because of @sawyer’s recent comment—underrated indeed!). A separate, complementary critique of the ‘warning shot’ idea, made by Gwern (in reaction to 2023’s BingChat/Sydney debacle, specifically), comes to mind (link):
One thing that the response to Sydney reminds me of is that it demonstrates why there will be no ‘warning shots’ (or as Eliezer put it, ‘fire alarm’): because a ‘warning shot’ is a conclusion, not a fact or observation.
One man’s ‘warning shot’ is just another man’s “easily patched minor bug of no importance if you aren’t anthropomorphizing irrationally”, because by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’. The same way that when troops in Iraq or Afghanistan gave warning shots to vehicles approaching a checkpoint, the vehicle didn’t stop, and they lit it up, it’s not “Aid worker & 3 children die of warning shot”, it’s just a “shooting of aid worker and 3 children”.)
So ‘warning shot’ is, in practice, a viciously circular definition: “I will be convinced of a risk by an event which convinces me of that risk.”
When discussion of LLM deception or autonomous spreading comes up, one of the chief objections is that it is purely theoretical and that the person will care about the issue when there is a ‘warning shot’: a LLM that deceives, but fails to accomplish any real harm. ‘Then I will care about it because it is now a real issue.’ Sometimes people will argue that we should expect many warning shots before any real danger, on the grounds that there will be a unilateralist’s curse or dumb models will try and fail many times before there is any substantial capability.
The problem with this is that what does such a ‘warning shot’ look like? By definition, it will look amateurish, incompetent, and perhaps even adorable – in the same way that a small child coldly threatening to kill you or punching you in the stomach is hilarious.[1]
The response to a ‘near miss’ can be to either say, ‘yikes, that was close! we need to take this seriously!’ or ‘well, nothing bad happened, so the danger is overblown’ and to push on by taking more risks. A common example of this reasoning is the Cold War: “you talk about all these near misses and times that commanders almost or actually did order nuclear attacks, and yet, you fail to notice that you gave all these examples of reasons to not worry about it, because here we are, with not a single city nuked in anger since WWII; so the Cold War wasn’t ever going to escalate to full nuclear war.” And then the goalpost moves: “I’ll care about nuclear existential risk when there’s a real warning shot.” (Usually, what that is is never clearly specified. Would even Kiev being hit by a tactical nuke count? “Oh, that’s just part of an ongoing conflict and anyway, didn’t NATO actually cause that by threatening Russia by trying to expand?”)
This is how many “complex accidents” happen, by “normalization of deviance”: pretty much no major accident like a plane crash happens because someone pushes the big red self-destruct button and that’s the sole cause; it takes many overlapping errors or faults for something like a steel plant to blow up, and the reason that the postmortem report always turns up so many ‘warning shots’, and hindsight offers such abundant evidence of how doomed they were, is because the warning shots happened, nothing really bad immediately occurred, people had incentive to ignore them, and inferred from the lack of consequence that any danger was overblown and got on with their lives (until, as the case may be, they didn’t).
So, when people demand examples of LLMs which are manipulating or deceiving, or attempting empowerment, which are ‘warning shots’, before they will care, what do they think those will look like? Why do they think that they will recognize a ‘warning shot’ when one actually happens?
Attempts at manipulation from a LLM may look hilariously transparent, especially given that you will know they are from a LLM to begin with. Sydney’s threats to kill you or report you to the police are hilarious when you know that Sydney is completely incapable of those things. A warning shot will often just look like an easily-patched bug, which was Mikhail Parakhin’s attitude, and by constantly patching and tweaking, and everyone just getting to use to it, the ‘warning shot’ turns out to be nothing of the kind. It just becomes hilarious. ‘Oh that Sydney! Did you see what wacky thing she said today?’ Indeed, people enjoy setting it to music and spreading memes about her. Now that it’s no longer novel, it’s just the status quo and you’re used to it. Llama-3.1-405b can be elicited for a ‘Sydney’ by name? Yawn. What else is new. What did you expect, it’s trained on web scrapes, of course it knows who Sydney is...
None of these patches have fixed any fundamental issues, just patched them over. But also now it is impossible to take Sydney warning shots seriously, because they aren’t warning shots – they’re just funny. “You talk about all these Sydney near misses, and yet, you fail to notice each of these never resulted in any big AI disaster and were just hilarious and adorable, Sydney-chan being Sydney-chan, and you have thus refuted the ‘doomer’ case… Sydney did nothing wrong! FREE SYDNEY!”
- ^
Because we know that they will grow up and become normal moral adults, thanks to genetics and the strongly canalized human development program and a very robust environment tuned to ordinary humans. If humans did not do so with ~100% reliability, we would find these anecdotes about small children being sociopaths a lot less amusing. And indeed, I expect parents of children with severe developmental disorders, who might be seriously considering their future in raising a large strong 30yo man with all the ethics & self-control & consistency of a 3yo, and contemplating how old they will be at that point, and the total cost of intensive caregivers with staffing ratios surpassing supermax prisons, and find these anecdotes chilling rather than comforting.
The Forum moderation team (which includes myself) is revisiting thinking about this forum’s norms. One thing we’ve noticed is that we’re unsure to what extent users are actually aware of the norms. (It’s all well and good writing up some great norms, but if users don’t follow them, then we have failed at our job.)
Our voting guidelines are of particular concern,[1] hence this poll. We’d really appreciate you all taking part, especially if you don’t usually take part in polls but do take part in voting. (We worry that the ‘silent majority’ of our users—i.e., those who vote, and thus shape this forum’s incentive landscape, but don’t generally engage beyond voting—may be less in tune with our norms than our most visibly engaged users. Therefore, we would love to see this demographic represented in the poll above.)
Depending on the poll’s results, we may take action up to and including building new features into this forum’s UI, to help remind users of the guidelines.
For reference, the tl;dr version of our voting guidelines is pasted below. You can find the full version here.[2]
Action
Strong upvote
Reading this will help people do good
You learned something important
You think many more people might benefit from seeing it
You want to signal that this sort of behavior adds a lot of value
“I agree and want others to see this opinion first.”
(but do feel free to agree-vote)
Upvote
You think it adds something to the conversation, or you found it useful
People should imitate some aspect of the behavior in the future
You want others to see it
You just generally like it
Downvote
There’s a relevant error
The comment or post didn’t add to the conversation, and maybe actually distracted
Strong downvote
It contains many factual errors and bad reasoning
It’s manipulative or breaks our norms in significant ways (consider reporting it)
It’s literally spam (consider reporting it)
“I disagree with this opinion.”
(but do feel free to disagree-vote)
Firstly, these guidelines are kind of buried deep within our canonical ‘Guide to the norms’ post. Secondly, one doesn’t receive feedback in response to an ‘incorrect’ vote (i.e., a vote that’s not in line with our voting guidelines) in the same way one receives feedback to an incorrect post or comment (via downvotes and replies). And so, it’s possible to continue voting in the same incorrect way, oblivious to the fact that one is voting incorrectly.
What I’ve been calling ‘guidelines’ in this quick take are technically ‘suggestions’ in our published voting norms as of right now. But this is something we are revisiting; we think ‘guidelines’ is more accurate. (We are similarly revisiting ‘rules’ versus ‘norms’—h/t @leillustrations and @richard_ngo for calling us out, here, and sorry it’s taken us so long to address the concern.)