Joe_Carlsmith

Karma: 3,503

Senior research analyst at Open Philanthropy. Doctorate in philosophy at the University of Oxford. Opinions my own.

The goal-guarding hypothesis (Section 2.3.1.1 of “Scheming AIs”)

Joe_CarlsmithDec 2, 2023, 3:20 PM

6 points

1 comment EA link

How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of “Scheming AIs”)

Joe_CarlsmithDec 1, 2023, 2:51 PM

6 points

0 comments EA link

Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of “Scheming AIs”)

Joe_CarlsmithNov 30, 2023, 4:43 PM

6 points

1 comment EA link

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)

Joe_CarlsmithNov 29, 2023, 4:32 PM

7 points

0 comments EA link

Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)

Joe_CarlsmithNov 28, 2023, 1:49 PM

8 points

0 comments EA link

Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)

Joe_CarlsmithNov 27, 2023, 6:01 PM

11 points

1 comment EA link

Situational awareness (Section 2.1 of “Scheming AIs”)

Joe_CarlsmithNov 26, 2023, 11:00 PM

12 points

1 comment EA link

On “slack” in training (Section 1.5 of “Scheming AIs”)

Joe_CarlsmithNov 25, 2023, 5:51 PM

14 points

1 comment EA link

Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”)

Joe_CarlsmithNov 24, 2023, 7:18 PM

10 points

1 comment EA link

A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”)

Joe_CarlsmithNov 22, 2023, 3:24 PM

6 points

0 comments EA link

Varieties of fake alignment (Section 1.1 of “Scheming AIs”)

Joe_CarlsmithNov 21, 2023, 3:00 PM

6 points

0 comments EA link

New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

Joe_CarlsmithNov 15, 2023, 5:16 PM

71 points

4 comments EA link

Superforecasting the premises in “Is power-seeking AI an existential risk?”

Joe_CarlsmithOct 18, 2023, 8:33 PM

114 points

3 comments EA link

In memory of Louise Glück

Joe_CarlsmithOct 15, 2023, 3:10 AM

22 points

2 comments8 min readEA link

The “no sandbagging on checkable tasks” hypothesis

Joe_CarlsmithJul 31, 2023, 11:13 PM

10 points

0 comments9 min readEA link

Predictable updating about AI risk

Joe_CarlsmithMay 8, 2023, 10:05 PM

134 points

12 comments36 min readEA link

[Linkpost] Shorter version of report on existential risk from power-seeking AI

Joe_CarlsmithMar 22, 2023, 6:06 PM

49 points

1 comment1 min readEA link

A Stranger Priority? Topics at the Outer Reaches of Effective Altruism (my dissertation)

Joe_CarlsmithFeb 21, 2023, 5:16 PM

64 points

0 comments1 min readEA link

Seeing more whole

Joe_CarlsmithFeb 17, 2023, 5:14 AM

122 points

9 comments26 min readEA link

Why should ethical anti-realists do ethics?

Joe_CarlsmithFeb 16, 2023, 4:27 PM

118 points

10 comments27 min readEA link