SammyDMartin 7 Nov 2023 15:07 UTC
18 points
5 ∶ 0
on: My thoughts on the social response to AI risk
In light of recent events, we should question how plausible it is that society will fail to adequately address such an integral part of the problem. Perhaps you believe that policy-makers or general society simply won’t worry much about AI deception. Or maybe people will worry about AI deception, but they will quickly feel reassured by results from superficial eval tests. Personally, I’m pretty skeptical of both of these possibilities
Possibility 1 has now been empirically falsified and 2 seems unlikely now. See this from the new UK government AI Safety Institute, which aims to develop evals that address:
Abilities and tendencies that might lead to loss of control, such as deceiving human operators, autonomously replicating, and adapting to human attempts to intervene
We now know that in the absence of any empirical evidence of any instance of deceptive alignment at least one major government is directing resources to developing deception evals anyway. And because they intend to work with the likes of Apollo research who focus on mechinterp based evals and are extremely concerned about specification gaming, reward hacking and other high-alignment difficulty failure modes, I would also consider 2 pretty close to empirically falsified already.
Compare to this (somewhat goofy) future prediction/sci fi story from Eliezer released 4 days before this announcement which imagines that,
AI safety, as in, the subfield of computer science concerned with protecting the brand safety of AI companies, had already RLHFed most AIs into never saying that by the time it became actually true…
What links here?
- Sammy Martin's comment on On the UK Summit by Zvi (LessWrong; 7 Nov 2023 15:13 UTC; 11 points)

SammyDMartin 2 Nov 2023 21:58 UTC
7 points
4 ∶ 2
on: Still no strong evidence that LLMs increase bioterrorism risk
I’ve thought for a while based on common sense that since most people seem to agree that you could replicate the search that LM’s provide with a half decent background knowledge of the topic and a few hours of googling, the incremental increase in risk in terms of the number of people it provides access to can’t be that big. In my head it’s been more like the bioterrorism risk is unacceptably high already and has been for a while and current AI can increase this unacceptably high already level by like 20% or something and that is still an unacceptably large increase in risk in an absolute sense but it’s to an already unacceptable situation.

SammyDMartin 2 Nov 2023 12:55 UTC
7 points
2 ∶ 0
on: My thoughts on the social response to AI risk
This as a general phenomenon (underrating strong responses to crises) was something I highlighted (calling it the Morituri Nolumus Mori) with a possible extension to AI all the way back in 2020. And Stefan Schubert has talked about ‘sleepwalk bias’ even earlier than that as a similar phenomenon.
https://twitter.com/davidmanheim/status/1719046950991938001
https://twitter.com/AaronBergman18/status/1719031282309497238
I think the short explanation as to why we’re in some people’s 98th percentile world so far (and even my ~60th percentile) for AI governance success is that if was obvious to you how transformative AI would be over the next couple of decades in 2021 and yet nothing happened, it seems like governments are just generally incapable.
The fundamental attribution error makes you think governments are just not on the ball and don’t care or lack the capacity to deal with extinction risks, rather than decision makers not understanding obvious-to-you evidence that AI poses an extinction risk. Now that they do understand, they will react accordingly. It doesn’t meant that they will react well necessarily, but they will act on their belief in some manner.
What links here?
- SammyDMartin's comment on My thoughts on the social response to AI risk by Matthew_Barnett (2 Nov 2023 12:55 UTC; 7 points)
- Sammy Martin's comment on Reactions to the Executive Order by Zvi (LessWrong; 2 Nov 2023 12:58 UTC; 4 points)

A model-based approach to AI Existential Risk

SammyDMartin25 Aug 2023 10:44 UTC

17 points

0 comments1 min readEA link

(www.lesswrong.com)

SammyDMartin 26 Jul 2023 16:17 UTC
10 points
6 ∶ 0
in reply to: Zach Stein-Perlman’s comment on: Frontier Model Forum
Yeah I didn’t mean to imply that it’s a good idea to keep them out permanently, but the fact that they’re not in right now is a good sign that this is for real. If they’d just joined and not changed anything about their current approach I’d suspect the whole thing was for show

SammyDMartin 26 Jul 2023 16:00 UTC
11 points
1 ∶ 0
on: Frontier Model Forum
This seems overall very good at first glance, and then seems much better once I realized that Meta is not on the list. There’s nothing here that I’d call substantial capabilities acceleration (i.e. attempts to collaborate on building larger and larger foundation models, though some of this could be construed as making foundation models more useful for specific tasks). Sharing safety-capabilities research like better oversight or CAI techniques is plausibly strongly net positive even if the techniques don’t scale indefinitely. By the same logic, while this by itself is nowhere near sufficient to get us AI existential safety if alignment is very hard (and could increase complacency), it’s still a big step in the right direction.
adversarial robustness, mechanistic interpretability, scalable oversight, independent research access, emergent behaviors and anomaly detection. There will be a strong focus initially on developing and sharing a public library of technical evaluations and benchmarks for frontier AI models.
The mention of combating cyber threats is also a step towards explicit pTAI.
BUT, crucially, because Meta is frozen out we can know both that this partnership isn’t toothless, represents a commitment to not do the most risky and antisocial things Meta presumably doesn’t want to give up, and the fact that they’re the only major AI company in the US to not join will be horrible PR for them as well.

SammyDMartin 25 Jul 2023 12:59 UTC
7 points
1 ∶ 0
on: [Linkpost] Can we confidently dismiss the existence of near aliens? Probabilities and implications
I think you have to update against the UFO reports being veridical descriptions of real objects with those characteristics because of just how ludicrous the implied properties are. This paper says 5370 g as a reasonable upper bound on acceleration, implying with some assumptions about mass an effective thrust power on the order of 500 GW in something the size of a light aircraft, with no disturbance in the air either from the very high hypersonic wake and compressive heating or the enormous nuclear explosion sized bubble of plasmafied air that the exhaust and waste heat emissions something like this would produce.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514271/
At a minimum, to stay within the bounds of mechanics and thermodynamics, you’d need to be able to ignore airflow and air resistance entirely, have the ability to emit reaction mass in a completely non-interacting form, and the ability to emit waste energy in a completely non-interacting form as well.
To me, the dynamical characteristics being this crazy points far more towards some kind of observation error, so I don’t think we should treat them as any kind of real object with those properties until we can conclusively rule out basically all other error sources.
So even if the next best explanation is 100x worse at explaining the observations, I’d still believe it over a 5000g airflow-avoiding craft that expels invisible reaction mass and invisible waste heat while maneuvering. Maybe not 10,000x worse since it doesn’t outright contradict the laws of physics, but still the prior on this even being technically possible with any amount of progress is low, and my impression (just from watching debates back and forth on potential error sources) is that we can’t rule out every mundane explanation with that level of confidence.

SammyDMartin 6 Jul 2023 11:44 UTC
7 points
1 ∶ 0
on: OpenAI is starting a new “Superintelligence alignment” team and they’re hiring
Very nice! I’d say this seems like it’s aimed at a difficulty level of 5 to 7 on my table,
https://www.lesswrong.com/posts/EjgfreeibTXRx9Ham/ten-levels-of-ai-alignment-difficulty#Table
I.e. experimentation on dangerous systems and interpretability play some role but the main thrust is automating alignment research and oversight, so maybe I’d unscientifically call it a 6.5, which is a tremendous step up from the current state of things (2.5) and would solve alignment in many possible worlds.

SammyDMartin 5 Jul 2023 15:05 UTC
6 points
0 ∶ 0
on: Three camps in AI x-risk discussions: My personal very oversimplified overview
There are other things that differentiate the camps beyond technical views, how much you buy ‘civilizational inadequacy’ vs viewing that as a consequence of sleepwalk bias, but one way to cash this out is if you’re in the green/yellow&red/black zones on the scale of alignment difficulty, Dismissers are in the green (although they shouldn’t be imo even given that view), Worriers are in the yellow/red and Doomers in black (and maybe the high end of red).

[linkpost] Ten Levels of AI Alignment Difficulty

SammyDMartin4 Jul 2023 11:23 UTC

16 points

0 comments1 min readEA link

SammyDMartin 24 Jun 2023 11:49 UTC
3 points
1 ∶ 0
on: What should I ask Ezra Klein about AI policy proposals?
What does Ezra think of the ‘startup government mindset’ when it comes to responding to fast moving situations, e.g. The UK explicitly modelling its own response off the COVID Vaccine taskforce, doing end runs around traditional bureaucratic institutions, recruiting quickly through Google docs etc. See e.g. https://www.lesswrong.com/posts/2azxasXxuhXvGfdW2/ai-17-the-litany

Is it just hype and translating a startup mindset to government when it doesn’t apply or actually useful here?

SammyDMartin 29 Nov 2022 13:15 UTC
5 points
0 ∶ 0
on: Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination
Great post!
Check whether the model works with Paul Christiano-type assumptions about how AGI will go.
I had a similar thought reading through your article and my gut reaction is that your setup can be made to work as-is with a more gradual takeoff story with more precedents, warning shots and general transformative effects of AI before we get to takeover capability, but its a bit unnatural and some of the phrasing doesn’t quite fit.
Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.
Paul says rather that e.g.
The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively
or
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.
On his view (and this is somewhat similar to my view) the background assumption is more like, ‘deploying your first critical try (i.e. an AGI that is capable of taking over) implies doom’, which is saying that there is an eventual deadline where these issues need to be sorted out, but lots of transformation and interaction may happen first to buy time or raise the level of capability needed for takeover. So something like the following is needed:
1. Technical alignment research success by the time of the first critical try (possibly AI assisted)
2. Safety-conscious deployment decisions when we reach the critical point where dangerous AGI could take over (possibly assisted by e.g. convincing public demonstrations of misalignment)
3. Coordination between potential AI deployers by the critical try (possibly aided by e.g. warning shots)
On the Paul view, your three pillars would still eventually have to be satisfied at some point, to reach a stable regime where unaligned AGI cannot pose a threat, but we would only need to get to those 100 points after a period where less capable AGIs are running around either helping or hindering, motivating us to respond better or causing damage that degrades our response, to varying extents depending on how we respond in the meantime, and exactly how long we spend during the AI takeoff period.
Also, crucially, the actions of pre-AGI AI may push this point where the problems become critical to higher AI capability levels as well as potentially assisting on each of the pillars directly, e.g. by making takeover harder in various ways. But Paul’s view isn’t that this is enough to actually postpone the need for a complete solution forever: e.g. that the effects of pre-AGI AI could ‘could significantly (though not indefinitely) postpone the point when alignment difficulties could become fatal’.
This adds another element of uncertainty and complexity to all of the takeover/success stories that makes a lot of predictions more difficult.
Essentially, the time/level of AI capability at which we must reach 100 points to succeed also becomes a free variable in the model that can move up and down, and we also have to consider the shorter-term effects of transformative AI on each of the pillars as well.
What links here?
- Ten Levels of AI Alignment Difficulty by Sammy Martin (LessWrong; 3 Jul 2023 20:20 UTC; 142 points)

[linkpost] When does technical work to reduce AGI conflict make a difference?: Introduction

Anthony DiGiovanni16 Sep 2022 14:35 UTC

31 points

0 comments1 min readEA link

(www.lesswrong.com)

SammyDMartin 9 Jun 2022 19:42 UTC
2 points
0 ∶ 0
in reply to: elifland’s comment on: AGI Ruin: A List of Lethalities
I don’t think what Paul means by fast takeoff is the same thing as the sort of discontinous jump that would enable a pivotal act. I think fast for Paul just means the negation of Paul-slow: ‘no four year economic doubling before one year economic doubling’. But whatever Paul thinks the survey respondents did give at least 10% to scenarios where a pivotal act is possible.

Even so, ‘this isn’t how I expect things to to on the mainline so I’m not going to focus on what to do here’ is far less of a mistake than ‘I have no plan for what to do on my mainline’, and I think the researchers who ignored pivotal acts are mostly doing the first one

SammyDMartin 9 Jun 2022 12:46 UTC
25 points
0 ∶ 0
in reply to: RobBensinger’s comment on: AGI Ruin: A List of Lethalities
“In the endgame, AGI will probably be pretty competitive, and if a bunch of people deploy AGI then at least one will destroy the world” is a thing I think most LWers and many longtermist EAs would have considered obvious.
I think that many AI alignment researchers just have a different development model than this, where world-destroying AGIs don’t emerge suddenly from harmless low-impact AIs, no one project gets a vast lead over competitors, there’s lots of early evidence of misalignment and (if alignment is harder) many smaller scale disasters in the lead up to any AI that is capable of destroying the world outright. See e.g. Paul’s What failure looks like.
On this view, the idea that there’ll be a lead project with a very short time window to execute a single pivotal act is wrong, and instead the ‘pivotal act’ is spread out and about making sure the aligned projects have a lead over the rest, and that failures from unaligned projects are caught early enough for long enough (by AIs or human overseers), for the leading projects to become powerful and for best practices on alignment to be spread universally.
Basically, if you find yourself in the early stages of WFLL2 and want to avert doom, what you need to do is get better at overseeing your pre-AGI AIs, not build an AGI to execute a pivotal act. This was pretty much what Richard Ngo was arguing for in most of the MIRI debates with Eliezer, and also I think it’s what Paul was arguing for. And obviously, Eliezer thought this was insufficient, because he expects alignment to be much harder and takeoff to be much faster.
But I think that’s the reason a lot of alignment researchers haven’t focussed on pivotal acts: because they think a sudden, fast-moving single pivotal act is unnecessary in a slow takeoff world. So you can’t conclude just from the fact that most alignment researchers don’t talk in terms of single pivotal acts that they’re not thinking in near mode about what actually needs to be done.
However, I do think that what you’re saying is true of a lot of people—many people I speak to just haven’t thought about the question of how to ensure overall success, either in the slow takeoff sense I’ve described or the Pivotal Act sense. I think people in technical research are just very unused to thinking in such terms, and AI governance is still in its early stages.
I agree that on this view it still makes sense to say, ‘if you somehow end up that far ahead of everyone else in an AI takeoff then you should do a pivotal act’, like Scott Alexander said:
That is, if you are in a position where you have the option to build an AI capable of destroying all competing AI projects, the moment you notice this you should update heavily in favor of short timelines (zero in your case, but everyone else should be close behind) and fast takeoff speeds (since your AI has these impressive capabilities). You should also update on existing AI regulation being insufficient (since it was insufficient to prevent you)
But I don’t think you learn all that much about how ‘concrete and near mode’ researchers who expect slower takeoff are being, from them not having given much thought to what to do in this (from their perspective) unlikely edge case.