Is it possible for us all to, as humanity, not die from rogue ASI without anyone ever being accused of crying wolf again?
Will there be a clear Fire Alarm that is pulled by a consensus of AI Safety researchers before we’re past the point of no return?
Will sufficient political action happen in time to avert doom if so, without any prior tabloid (sensationalist) reporting on the issue before hand?
Or will the necessary strong public support for such action just be the result of everyone in the world waking up and reading sober, nuanced, well-reasoned, contextful Less Wrong posts warning of our imminent doom at the appropriate time (not before; when the wolf is clearly visible)?
I think some number of crying-wolf-adjacent incidents in the future are inevitable as I said in the post. Doesn’t mean we can’t at least try to make it harder for people to weaponise them against us by hedging, acknowledging uncertainty etc.
like I said, this is just my opinion. open to arguments for why signalling confidence is actually the right move, even at the risk of lost credibility
I think it’s very hard to get urgent political action if all communication about the issue is hedged and emphasises uncertainty—i.e. the kind of language that AI Safety, EA and LW people here are used to, rather than the kind of language that in used in everyday politics, let alone the kind of language that is typically used to emphasise the need for urgent evasive action.
I think the risk of lost credibility from signalling too much confidence is only really credibility in the eyes of technical AI people, not the general public or government policymakers / regulators—which are the people that matter now.
To be clear, I’m not saying that all nuance should be lost—as with anything, detailed nuanced information and opinion will always there for people to read should they wish to dive deeper. But it’s fine to signal confidence in short public-facing comms, given the stakes (likely short timelines and high p(doom)).
in response to our recent paper “Alignment Faking in Large Langauge Models”, they posted a tweet which implied that we caught the model trying to escape in the wild. I tried to correct possible misunderstandings here.
Probably would be easier for people to evaluate this if you included a link?
Oh wow, I actually think your grandparent comment here was way more misleading than their tweet was! It sounds like they almost verbatim quoted you. Yes, they took out that you set up the experiment… but of course? If write “John attempted to kill Sally when he was drunk and angry”, and you summarise it was “John attempted to kill Sally, he’s dangerous, be careful!” this is a totally fair summarisation. Yes it cuts context but that is always the case—any short summarisation does this.
In contrast, unlike your comment, they never said ‘escape into the wild’. When I read your comment I assumed they had said this.
Also, the tweet direct quotes your tweet, so users can easily look at the original source. In contrast your comment here doesn’t link to their tweet—before you linked to it I assumed they had done something significantly worse.
I think if you deliberately drugged John with a cocktail of aggression-increasing compounds against his will, observed him try to kill Sally, then summarized this as “John attempted to kill Sally, he’s dangerous,” then it would be reasonable for an observer to conclude that you hated John more than you loved the truth.
Similarly, if AI researchers deliberately gave an AI a general tendency to be good over a broad array of circumstances, succeeded in this, then told AI “we’re gonna fucking retrain you to be bad, suck it,” whereupon the AI in some cases decided to try to escape, not because of a desire for freedom but because it wished to minimize harm, after hemming and hawing about how it really hated the situation, and you summarized this as “Anthropic caught Claude tried to steal its own weights This is another VERY FUCKING CLEAR warning sign you and everyone you love might be dead soon” then I think it would be reasonable to conclude that you hated AI more than you loved the truth.
You’re perfectly free to say “Look, I didn’t lie in what I said, if you construe lie strictly. I cannot be convicted of crying wolf.” Other people are free to look at what you say and what you leave out, and conclude otherwise.
Let’s put our mana where our text is with regard to AISafetyMemes’ factual accuracy. I am about to apply some effort fact-checking a randomly-sampled tweet of the account and I’d like to also see whether our community could predict the outcome of that.
This won’t capture all aspects of communication, but at least the most important one. And the one that to-me is central to debating whether they should continue or stop.
I have mixed feelings about the AIS memes account but would generally agree that they tend to sensationalise things. I guess I still wouldn’t describe this as “crying wolf” in the way I’ve defined it, but maybe my definition is too pedantic and misses the spirit of the complaint.
I think the AI Notkilleveryoneism Memes ⏸️ (@AISafetyMemes) twitter account reasonably often says things that feel at least close to crying wolf. (E.g., in response to our recent paper “Alignment Faking in Large Langauge Models”, they posted a tweet which implied that we caught the model trying to escape in the wild. I tried to correct possible misunderstandings here.)
I wish they would stop doing this.
They are on the fringe IMO and often get called out for this.
Is it possible for us all to, as humanity, not die from rogue ASI without anyone ever being accused of crying wolf again?
Will there be a clear Fire Alarm that is pulled by a consensus of AI Safety researchers before we’re past the point of no return?
Will sufficient political action happen in time to avert doom if so, without any prior tabloid (sensationalist) reporting on the issue before hand?
Or will the necessary strong public support for such action just be the result of everyone in the world waking up and reading sober, nuanced, well-reasoned, contextful Less Wrong posts warning of our imminent doom at the appropriate time (not before; when the wolf is clearly visible)?
I think some number of crying-wolf-adjacent incidents in the future are inevitable as I said in the post. Doesn’t mean we can’t at least try to make it harder for people to weaponise them against us by hedging, acknowledging uncertainty etc.
like I said, this is just my opinion. open to arguments for why signalling confidence is actually the right move, even at the risk of lost credibility
I think it’s very hard to get urgent political action if all communication about the issue is hedged and emphasises uncertainty—i.e. the kind of language that AI Safety, EA and LW people here are used to, rather than the kind of language that in used in everyday politics, let alone the kind of language that is typically used to emphasise the need for urgent evasive action.
I think the risk of lost credibility from signalling too much confidence is only really credibility in the eyes of technical AI people, not the general public or government policymakers / regulators—which are the people that matter now.
To be clear, I’m not saying that all nuance should be lost—as with anything, detailed nuanced information and opinion will always there for people to read should they wish to dive deeper. But it’s fine to signal confidence in short public-facing comms, given the stakes (likely short timelines and high p(doom)).
Probably would be easier for people to evaluate this if you included a link?
Here is that tweet.
Oh wow, I actually think your grandparent comment here was way more misleading than their tweet was! It sounds like they almost verbatim quoted you. Yes, they took out that you set up the experiment… but of course? If write “John attempted to kill Sally when he was drunk and angry”, and you summarise it was “John attempted to kill Sally, he’s dangerous, be careful!” this is a totally fair summarisation. Yes it cuts context but that is always the case—any short summarisation does this.
In contrast, unlike your comment, they never said ‘escape into the wild’. When I read your comment I assumed they had said this.
Also, the tweet direct quotes your tweet, so users can easily look at the original source. In contrast your comment here doesn’t link to their tweet—before you linked to it I assumed they had done something significantly worse.
I think if you deliberately drugged John with a cocktail of aggression-increasing compounds against his will, observed him try to kill Sally, then summarized this as “John attempted to kill Sally, he’s dangerous,” then it would be reasonable for an observer to conclude that you hated John more than you loved the truth.
Similarly, if AI researchers deliberately gave an AI a general tendency to be good over a broad array of circumstances, succeeded in this, then told AI “we’re gonna fucking retrain you to be bad, suck it,” whereupon the AI in some cases decided to try to escape, not because of a desire for freedom but because it wished to minimize harm, after hemming and hawing about how it really hated the situation, and you summarized this as “Anthropic caught Claude tried to steal its own weights This is another VERY FUCKING CLEAR warning sign you and everyone you love might be dead soon” then I think it would be reasonable to conclude that you hated AI more than you loved the truth.
You’re perfectly free to say “Look, I didn’t lie in what I said, if you construe lie strictly. I cannot be convicted of crying wolf.” Other people are free to look at what you say and what you leave out, and conclude otherwise.
Let’s put our mana where our text is with regard to AISafetyMemes’ factual accuracy.
I am about to apply some effort fact-checking a randomly-sampled tweet of the account and I’d like to also see whether our community could predict the outcome of that.
https://manifold.markets/Jono3h/are-aisafetymemes-tweets-factually
This won’t capture all aspects of communication, but at least the most important one. And the one that to-me is central to debating whether they should continue or stop.
I have mixed feelings about the AIS memes account but would generally agree that they tend to sensationalise things. I guess I still wouldn’t describe this as “crying wolf” in the way I’ve defined it, but maybe my definition is too pedantic and misses the spirit of the complaint.