Scheming AIs: Will AIs fake alignment during training in order to get power?

21 Nov 2023 16:30 UTC

This is an EA Forum sequence version of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”, available on arXiv here: https://arxiv.org/pdf/2311.08379.pdf. It’s a long report, and I’m hoping that having shorter sections available as separate posts will make them easier to digest, reference, and comment on.

The first post in the sequence contains a summary of the full report. The summary covers most of the main points and technical terms, and I’m hoping it will provide much of the context necessary to understand individual sections of the report on their own.

New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

Joe_Carlsmith15 Nov 2023 17:16 UTC

71 points

4 comments30 min readEA link

Varieties of fake alignment (Section 1.1 of “Scheming AIs”)

Joe_Carlsmith21 Nov 2023 15:00 UTC

6 points

0 comments10 min readEA link

A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”)

Joe_Carlsmith22 Nov 2023 15:24 UTC

6 points

0 comments6 min readEA link

Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”)

Joe_Carlsmith24 Nov 2023 19:18 UTC

10 points

1 comment20 min readEA link

On “slack” in training (Section 1.5 of “Scheming AIs”)

Joe_Carlsmith25 Nov 2023 17:51 UTC

14 points

1 comment5 min readEA link

Situational awareness (Section 2.1 of “Scheming AIs”)

Joe_Carlsmith26 Nov 2023 23:00 UTC

12 points

1 comment6 min readEA link

Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)

Joe_Carlsmith27 Nov 2023 18:01 UTC

11 points

1 comment8 min readEA link

Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)

Joe_Carlsmith28 Nov 2023 13:49 UTC

8 points

0 comments13 min readEA link

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)

Joe_Carlsmith29 Nov 2023 16:32 UTC

7 points

0 comments10 min readEA link

Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of “Scheming AIs”)

Joe_Carlsmith30 Nov 2023 16:43 UTC

6 points

1 comment5 min readEA link

How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of “Scheming AIs”)

Joe_Carlsmith1 Dec 2023 14:51 UTC

6 points

0 comments6 min readEA link

The goal-guarding hypothesis (Section 2.3.1.1 of “Scheming AIs”)

Joe_Carlsmith2 Dec 2023 15:20 UTC

6 points

1 comment12 min readEA link

Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of “Scheming AIs”)

Joe_Carlsmith3 Dec 2023 18:32 UTC

6 points

1 comment15 min readEA link

Non-classic stories about scheming (Section 2.3.2 of “Scheming AIs”)

Joe_Carlsmith4 Dec 2023 18:44 UTC

12 points

1 comment16 min readEA link

Arguments for/against scheming that focus on the path SGD takes (Section 3 of “Scheming AIs”)

Joe_Carlsmith5 Dec 2023 18:48 UTC

7 points

1 comment20 min readEA link

The counting argument for scheming (Sections 4.1 and 4.2 of “Scheming AIs”)

Joe_Carlsmith6 Dec 2023 19:28 UTC

9 points

1 comment7 min readEA link

Simplicity arguments for scheming (Section 4.3 of “Scheming AIs”)

Joe_Carlsmith7 Dec 2023 15:05 UTC

6 points

1 comment14 min readEA link

Speed arguments against scheming (Section 4.4-4.7 of “Scheming AIs”)

Joe_Carlsmith8 Dec 2023 21:10 UTC

6 points

0 comments11 min readEA link

Summing up “Scheming AIs” (Section 5)

Joe_Carlsmith9 Dec 2023 15:48 UTC

9 points

1 comment10 min readEA link

Empirical work that might shed light on scheming (Section 6 of “Scheming AIs”)

Joe_Carlsmith11 Dec 2023 16:30 UTC

7 points

1 comment19 min readEA link