Ryan Greenblatt

Karma: 1,108

This other Ryan Greenblatt is my old account^[1]. Here is my LW account.

^
Account lost to the mists of time and expired university email addresses.

Ryan Greenblatt Mar 5, 2025, 7:35 PM
15 points
0 ∶ 0
on: Announcing: Existential Choices Debate Week (March 17-24)
I think reducing the risk of misaligned AI takeover looks like a pretty good usage of people on the margin. My guess is that misaligned AI takeover typically doesn’t result in extinction in the normal definition of the term (killing basically all humans within 100 years). (Maybe I think the chance of extinction-defined-normally given AI takeover is ¹⁄₃.)
Thus, for me, the bottom line of the debate statement comes down to whether misaligned AI takeover which doesn’t result in extinction-defined-normally actually counts as extinction in the definition used in the post.
I don’t feel like I understand the definition you give of “a future with 0 value” handles cases like:
“Misaligned AIs takeover and have preferences that on their own have ~0 value from our perspective. However, these AIs keep most humans alive out of a small amount of kindness and due to acausal trade. Additionally, lots of stuff happens in our lightcone which is good due to acausal trade (but this was paid for by some entity that shared our preferences). Despite this, misaligned AI takeover is actually somewhat worse (from a pure longtermist perspective) than life on earth being wiped about prior to this point, because aliens were about 50% likely to be able to colonize most of our lightcone (or misaligned AIs they create would do this colonization) and they share our preferences substantially more than the AIs do.”
More generally, my current overall guess at a preference ordering something like: control by a relatively enlightened human society that shares my moral perspectives (and has relatively distributed power > human control where power is roughly as democratic as now > human dictator > humans are driven extinct but primates aren’t (so probably other primates develop an intelligent civilization in like 10-100 million years) > earth is wiped out totally (no AIs and no chance for intelligent civilization to re-evolve) > misaligned AI takeover > earth is wiped out and there aren’t aliens so nothing ever happens with resources in our lightcone > various s-risk scenarios.
What line here counts as “extinction”? Does moving from misaligned AI takeover to “human control where power is roughly as democratic as now” count as an anti extinction scenario?

Ryan Greenblatt Mar 5, 2025, 6:08 PM
8 points
1 ∶ 0
in reply to: GideonF’s comment on: Gideon Futerman’s Quick takes
I think work of the sort you’re discussing isn’t typically called digital minds work. I would just describe this as “trying to ensure better futures (from a scope-sensitive longtermist perspective) other than via avoiding AI takeover, human power grabs, or extinction (from some other source)”.
This just incidentally ends up being about digital entities/beings/value because that’s where the vast majority of the value probably lives.
The way you phrase (1) seems to imply that you think large fractions of expected moral value (in the long run) will be in the minds of laborers (AIs we created to be useful) rather than things intentionally created to provide value/disvalue. I’m skeptical.

AI companies are unlikely to make high-assurance safety cases if timelines are short

Ryan GreenblattJan 23, 2025, 6:41 PM

45 points

1 comment1 min readEA link

Ryan Greenblatt Jan 16, 2025, 3:11 AM
9 points
2 ∶ 0
in reply to: Ozzie Gooen’s comment on: Ozzie Gooen’s Shortform
A large reason to focus on opaque components of larger systems is that difficult-to-handle and existentially risky misalignment concerns are most likely to occur within opaque components rather than emerge from human built software.
I don’t see any plausible x-risk threat models that emerge directly from AI software written by humans? (I can see some threat models due to AIs building other AIs by hand such that the resulting system is extremely opaque and might takeover.)
In the comment you say “LLMs”, but I’d note that a substantial fraction of this research probably generalizes fine to arbitrary DNNs trained with something like SGD. More generally, various approaches that work for DNNs trained with SGD plausibly generalize to other machine learning approaches.

Ryan Greenblatt Jan 14, 2025, 6:20 PM
5 points
0 ∶ 0
in reply to: Larks’s comment on: Are AI safetyists crying wolf?
Here is that tweet.

Ryan Greenblatt Jan 10, 2025, 12:59 AM
68 points
52 ∶ 10
on: Are AI safetyists crying wolf?
I think the AI Notkilleveryoneism Memes ⏸️ (@AISafetyMemes) twitter account reasonably often says things that feel at least close to crying wolf. (E.g., in response to our recent paper “Alignment Faking in Large Langauge Models”, they posted a tweet which implied that we caught the model trying to escape in the wild. I tried to correct possible misunderstandings here.)
I wish they would stop doing this.
They are on the fringe IMO and often get called out for this.

Ryan Greenblatt Dec 23, 2024, 10:03 PM
15 points
4 ∶ 0
on: It looks like there are some good funding opportunities in AI safety right now
The Long Term Future Fund (LTFF) also looks pretty good IMO, especially if you’re less optimistic about policy.

Ryan Greenblatt Dec 19, 2024, 8:35 PM
5 points
1 ∶ 0
in reply to: Ebenezer Dukakis’s comment on: Alignment Faking in Large Language Models
I don’t think non-myopia is required to prevent jailbreaks. A model can in principle not care about the effects of training on it and not care about longer term outcomes while still implementing a policy that refuses harmful queries.
I think we should want models to be quite deontological about corrigibility.
This isn’t responding to this overall point and I agree by default there is some tradeoff (in current personas) unless you go out of your way to avoid this.
(And, I don’t think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)
What links here?
- AIs Will Increasingly Fake Alignment by Zvi (LessWrong; Dec 24, 2024, 1:00 PM; 89 points)

Alignment Faking in Large Language Models

Ryan GreenblattDec 18, 2024, 5:19 PM

142 points

9 comments1 min readEA link

Ryan Greenblatt Dec 14, 2024, 6:09 AM
10 points
2 ∶ 0
in reply to: Habryka [Deactivated]’s comment on: Yanni Kyriacos’s Shortform
This is an old thread, but I’d like to confirm that a high fraction of my motivation for being vegan^[1] is signaling to others and myself. (So, n=1 for this claim.) (A reasonable fraction of my motivation is more deontological.)
1. ^
  I eat fish rarely as I was convinced that the case for this improving productivity is sufficiently strong.

Ryan Greenblatt Dec 6, 2024, 11:53 PM
6 points
4 ∶ 0
in reply to: Ben Millwood🔸’s comment on: BenMillwood’s Shortform
I suppose the complement to the naive thing I said before is “80k needs a compelling reason to recruit people to EA, and needs EA to be compelling to the people to recruit to it as well; by doing an excellent job at some object-level work, you can grow the value of 80k recruiting, both by making it easier to do and by making the outcome a more valuable outcome. Perhaps this might be even better for recruiting than doing recruiting.”
I think there are a bunch of meta effects from working in an object level job:
- The object level work makes people more likely to enter the field as you note. (Though this doesn’t just route through 80k and goes through a bunch of mechanisms.)
- You’ll probably have some conversations with people considering entering the field from a slightly more credible position at least if the object level stuff goes well.
- Part of the work will likely involve fleshing stuff out so people with less context can more easily join/contribute. (True for most / many jobs.)

Ryan Greenblatt Oct 17, 2024, 2:59 AM
9 points
3 ∶ 1
in reply to: MichaelStJules’s comment on: JWS’s Shortform
I think people wouldn’t normally consider it Pascalian to enter a postive total returns lottery with a 1 / 20,000 (50 / million) chance of winning?
And people don’t consider it to be Pascalian to vote, to fight in a war, or to advocate for difficult to pass policy that might reduce the chance of nuclear war?
Maybe you have a different-than-typical perspective on what it means for something to be Pascalian?

Ryan Greenblatt Sep 12, 2024, 2:25 AM
4 points
2 ∶ 0
in reply to: titotal’s comment on: That Alien Message—The Animation
I agree that it is a poor analogy for AI risk. However, I do think it is a semi-reasonable intuition pump for why AIs that are very superhuman would be an existential problem if misaligned (and without other serious countermeasures).

Ryan Greenblatt Jul 24, 2024, 8:34 PM
4 points
1 ∶ 0
in reply to: richard_ngo’s comment on: JWS’s Shortform
I think that the political activation of Silicon Valley is the sort of thing which could reshape american politics, and that twitter is a leading indicator.
I don’t disagree with this statement, but also think the original comment is reading into twitter way too much.

Ryan Greenblatt Jul 24, 2024, 4:48 PM
1 point
0 ∶ 0
in reply to: JWS 🔸’s comment on: JWS’s Shortform

I haven’t seen those comments

Scroll down to see comments.

Ryan Greenblatt Jul 23, 2024, 8:30 PM
14 points
6 ∶ 0
in reply to: JWS 🔸’s comment on: JWS’s Shortform
Once again, if you disagree, I’d love to actually here why.
I think you’re reading into twitter way too much.

Ryan Greenblatt Jul 23, 2024, 8:23 PM
7 points
1 ∶ 0
in reply to: JWS 🔸’s comment on: JWS’s Shortform
absence of evidence of good arguments against it is evidence of the absence of said arguments. (tl;dr—AI Safety people, engage with 1a3orn more!)
There are many (edit: 2) comments responding and offering to talk. 1a3orn doesn’t appear to have replied to any of these comments. (To be clear, I’m not saying they’re under any obligation here, just that there isn’t a absence of attempted engagement and thus you shouldn’t update in the direction you seem to be updating here.)

Ryan Greenblatt Jul 23, 2024, 6:14 AM
6 points
1 ∶ 0
on: Thoughts on SB-1047
The limited duty exemption has been removed from the bill which probably makes compliance notably more expensive while not improving safety. (As far as I can tell.)
This seems unfortunate.
I think you should still be able to proceed in a somewhat reasonable way by making a safety case on the basis of insufficient capability, but there are still additional costs associated with not getting an exemption.
Further, you can’t just claim an exemption prior to starting training if you are behind the frontier which will substantially increase the costs on some actors.
This makes me more uncertain about whether the bill is good, though I think it will probably still be net positive and basically reasonable on the object level. (Though we’ll see about futher amendments, enforcement, and the response from society...)

(LW x-post)

Ryan Greenblatt Jul 5, 2024, 10:33 PM
3 points
0 ∶ 0
in reply to: Steven Byrnes’s comment on: On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI
I agree that these models assume something like “large discontinuous algorithmic breakthroughs aren’t needed to reach AGI”.
(But incremental advances which are ultimately quite large in aggregate and which broadly follow long running trends are consistent.)
However, I interpreted “current paradigm + scale” in the original post as “the current paradigm of scaling up LLMs and semi-supervised pretraining”. (E.g., not accounting for totally new RL schemes or wildly different architectures trained with different learning algorithms which I think are accounted for in this model.)

Ryan Greenblatt Jul 5, 2024, 7:01 PM
9 points
3 ∶ 0
in reply to: titotal’s comment on: titotal’s Quick takes
Both AI doomers and accelerationists will come out looking silly, but will both argue that we are only an algorithmic improvement away from godlike AGI.
A common view is a median around 2035-2050 with substantial (e.g. 25%) mass in the next 6 years or so.
This view is consistent with both thinking:
- LLM progress is likely (>50%) to stall out.
- LLMs are plausibly going to quickly scale into very powerful AI.
(This is pretty similar to my view.)
I don’t think many people think “we are only an algorithmic improvement away from godlike AGI”. In fact, I can’t think of anyone who thinks this. Some people think that 1 substantial algorithmic advance + continued scaling/general algorithmic improvement, but the continuation of other improvements is key.

Ryan Greenblatt

AI com­pa­nies are un­likely to make high-as­surance safety cases if timelines are short

Align­ment Fak­ing in Large Lan­guage Models

AI companies are unlikely to make high-assurance safety cases if timelines are short

Alignment Faking in Large Language Models