AI alignment

TagLast edit: 22 Jul 2022 20:58 UTC by Leo

AI alignment is research on how to align AI systems with human or moral goals.

Evaluation

80,000 Hours rates AI alignment a “highest priority area”: a problem at the top of their ranking of global issues assessed by importance, tractability and neglectedness.^[1]

External links

AI Alignment Forum.

^
80,000 Hours (2021) Our current list of the most important world problems, 80,000 Hours.

2019 AI Alignment Literature Review and Charity Comparison

Larks19 Dec 2019 2:58 UTC

147 points

28 comments62 min readEA link

2018 AI Alignment Literature Review and Charity Comparison

Larks18 Dec 2018 4:48 UTC

118 points

28 comments63 min readEA link

AGI Safety Fundamentals curriculum and application

richard_ngo20 Oct 2021 21:45 UTC

123 points

20 comments8 min readEA link

(docs.google.com)

Why AI alignment could be hard with modern deep learning

Ajeya21 Sep 2021 15:35 UTC

157 points

17 comments14 min readEA link

(www.cold-takes.com)

AI Research Considerations for Human Existential Safety (ARCHES)

Andrew Critch21 May 2020 6:55 UTC

29 points

0 comments3 min readEA link

(acritch.com)

Disentangling arguments for the importance of AI safety

richard_ngo23 Jan 2019 14:58 UTC

63 points

14 comments8 min readEA link

Why I prioritize moral circle expansion over reducing extinction risk through artificial intelligence alignment

Jacy20 Feb 2018 18:29 UTC

107 points

72 comments35 min readEA link

(www.sentienceinstitute.org)

Delegated agents in practice: How companies might end up selling AI services that act on behalf of consumers and coalitions, and what this implies for safety research

Remmelt26 Nov 2020 16:39 UTC

11 points

0 comments4 min readEA link

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Rohin Shah13 May 2022 12:19 UTC

102 points

0 comments9 min readEA link

My current thoughts on MIRI’s “highly reliable agent design” work

Daniel_Dewey7 Jul 2017 1:17 UTC

60 points

59 comments19 min readEA link

Stable Emergence in a Developmental AI Architecture: Results from “Twins V3”

PV517 Nov 2025 23:23 UTC

6 points

2 comments2 min readEA link

Preventing an AI-related catastrophe—Problem profile

Benjamin Hilton29 Aug 2022 18:49 UTC

139 points

18 comments4 min readEA link

(80000hours.org)

2016 AI Risk Literature Review and Charity Comparison

Larks13 Dec 2016 4:36 UTC

57 points

12 comments28 min readEA link

The academic contribution to AI safety seems large

technicalities30 Jul 2020 10:30 UTC

120 points

28 comments9 min readEA link

Hiring engineers and researchers to help align GPT-3

Paul_Christiano1 Oct 2020 18:52 UTC

107 points

19 comments3 min readEA link

AI alignment researchers may have a comparative advantage in reducing s-risks

Lukas_Gloor15 Feb 2023 13:01 UTC

81 points

5 comments13 min readEA link

Crazy ideas sometimes do work

Aryeh Englander4 Sep 2021 3:27 UTC

71 points

8 comments1 min readEA link

Launching applications for AI Safety Careers Course India 2024

varun_agr1 May 2024 5:30 UTC

23 points

1 comment1 min readEA link

2017 AI Safety Literature Review and Charity Comparison

Larks20 Dec 2017 21:54 UTC

43 points

17 comments23 min readEA link

Why Moral Conflict Resolution Still Breaks Our Best Safety Tools

J.S.18 Nov 2025 7:49 UTC

6 points

0 comments2 min readEA link

AGI safety career advice

richard_ngo2 May 2023 7:36 UTC

214 points

18 comments13 min readEA link

Large Language Models as Fiduciaries to Humans

johnjnay24 Jan 2023 19:53 UTC

25 points

0 comments34 min readEA link

(papers.ssrn.com)

What is it to solve the alignment problem? (Notes)

Joe_Carlsmith24 Aug 2024 21:19 UTC

32 points

1 comment53 min readEA link

A tale of 2.5 orthogonality theses

Arepo1 May 2022 13:53 UTC

142 points

31 comments11 min readEA link

Alignment ideas inspired by human virtue development

Borys Pikalov18 May 2025 9:36 UTC

6 points

0 comments4 min readEA link

[Question] What are the coolest topics in AI safety, to a hopelessly pure mathematician?

Jenny K E7 May 2022 7:18 UTC

89 points

29 comments1 min readEA link

AGI safety from first principles

richard_ngo21 Oct 2020 17:42 UTC

77 points

10 comments3 min readEA link

(www.alignmentforum.org)

My personal cruxes for working on AI safety

Buck13 Feb 2020 7:11 UTC

137 points

35 comments44 min readEA link

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub12 Jan 2024 19:51 UTC

65 points

0 comments3 min readEA link

(arxiv.org)

There are no coherence theorems

Elliott Thornley20 Feb 2023 21:52 UTC

108 points

49 comments19 min readEA link

Introducing The Nonlinear Fund: AI Safety research, incubation, and funding

Kat Woods 🔶 ⏸️18 Mar 2021 14:07 UTC

71 points

32 comments5 min readEA link

Scrutinizing AI Risk (80K, #81) - v. quick summary

Ben23 Jul 2020 19:02 UTC

10 points

1 comment3 min readEA link

Draft report on existential risk from power-seeking AI

Joe_Carlsmith28 Apr 2021 21:41 UTC

88 points

34 comments1 min readEA link

[Link post] Coordination challenges for preventing AI conflict

stefan.torges9 Mar 2021 9:39 UTC

58 points

0 comments1 min readEA link

(longtermrisk.org)

AI alignment shouldn’t be conflated with AI moral achievement

Matthew_Barnett30 Dec 2023 3:08 UTC

117 points

15 comments5 min readEA link

[Linkpost] AI Alignment, Explained in 5 Points (updated)

Daniel_Eth18 Apr 2023 8:09 UTC

31 points

2 comments1 min readEA link

(medium.com)

“Aligned with who?” Results of surveying 1,000 US participants on AI values

Holly Morgan21 Mar 2023 22:07 UTC

41 points

0 comments2 min readEA link

(www.lesswrong.com)

[Question] What is most confusing to you about AI stuff?

Sam Clarke23 Nov 2021 16:00 UTC

25 points

15 comments1 min readEA link

Counterarguments to the basic AI risk case

Katja_Grace14 Oct 2022 20:30 UTC

287 points

23 comments34 min readEA link

How do takeoff speeds affect the probability of bad outcomes from AGI?

KR7 Jul 2020 17:53 UTC

18 points

0 comments8 min readEA link

Techies Wanted: How STEM Backgrounds Can Advance Safe AI Policy

Daniel_Eth26 May 2025 11:29 UTC

41 points

1 comment29 min readEA link

What is it like doing AI safety work?

Kat Woods 🔶 ⏸️21 Feb 2023 19:24 UTC

99 points

2 comments10 min readEA link

A central AI alignment problem: capabilities generalization, and the sharp left turn

So8res15 Jun 2022 14:19 UTC

54 points

2 comments10 min readEA link

Deceptive Alignment is <1% Likely by Default

DavidW21 Feb 2023 15:07 UTC

54 points

26 comments14 min readEA link

TAI Safety Bibliographic Database

Jess_Riedel22 Dec 2020 16:03 UTC

61 points

9 comments17 min readEA link

From language to ethics by automated reasoning

Michele Campolo21 Nov 2021 15:16 UTC

8 points

0 comments6 min readEA link

AMA: Ajeya Cotra, researcher at Open Phil

Ajeya28 Jan 2021 17:38 UTC

84 points

105 comments1 min readEA link

Cognitive Science/Psychology As a Neglected Approach to AI Safety

Kaj_Sotala5 Jun 2017 13:46 UTC

40 points

37 comments4 min readEA link

Ngo and Yudkowsky on alignment difficulty

richard_ngo15 Nov 2021 22:47 UTC

71 points

13 comments94 min readEA link

Announcing AI Safety Support

Linda Linsefors19 Nov 2020 20:19 UTC

55 points

0 comments4 min readEA link

Train for incorrigibility, then reverse it (Shutdown Problem Contest Submission)

Daniel_Eth18 Jul 2023 8:26 UTC

16 points

0 comments2 min readEA link

Tetherware #1: The case for humanlike AI with free will

Jáchym Fibír30 Jan 2025 11:57 UTC

−3 points

2 comments10 min readEA link

(tetherware.substack.com)

On Deference and Yudkowsky’s AI Risk Estimates

bmg19 Jun 2022 14:35 UTC

291 points

194 comments17 min readEA link

Deep Deceptiveness

So8res21 Mar 2023 2:51 UTC

40 points

1 comment14 min readEA link

On how various plans miss the hard bits of the alignment challenge

So8res12 Jul 2022 5:35 UTC

126 points

13 comments29 min readEA link

The simple case for AI catastrophe, in four steps

Linch6 Feb 2026 17:38 UTC

17 points

2 comments10 min readEA link

(linch.substack.com)

Intellectual Diversity in AI Safety

KR22 Jul 2020 19:07 UTC

21 points

8 comments3 min readEA link

Announcing AXRP, the AI X-risk Research Podcast

DanielFilan23 Dec 2020 20:10 UTC

32 points

1 comment1 min readEA link

Alignment 201 curriculum

richard_ngo12 Oct 2022 19:17 UTC

94 points

9 comments1 min readEA link

(www.agisafetyfundamentals.com)

Chaining the evil genie: why “outer” AI safety is probably easy

titotal30 Aug 2022 13:55 UTC

40 points

12 comments10 min readEA link

[Question] How much EA analysis of AI safety as a cause area exists?

richard_ngo6 Sep 2019 11:15 UTC

96 points

20 comments2 min readEA link

Rohin Shah: What’s been happening in AI alignment?

EA Global29 Jul 2020 20:15 UTC

18 points

0 comments14 min readEA link

(www.youtube.com)

How might we align transformative AI if it’s developed very soon?

Holden Karnofsky29 Aug 2022 15:48 UTC

164 points

17 comments44 min readEA link

[linkpost] “What Are Reasonable AI Fears?” by Robin Hanson, 2023-04-23

Arjun Panickssery14 Apr 2023 23:26 UTC

41 points

3 comments4 min readEA link

(quillette.com)

Introduction to Pragmatic AI Safety [Pragmatic AI Safety #1]

TW1239 May 2022 17:02 UTC

68 points

0 comments6 min readEA link

Animal welfare concerns are dominated by post-ASI futures

RobertM22 Nov 2025 4:48 UTC

11 points

1 comment4 min readEA link

My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda

Chi15 Aug 2020 19:59 UTC

38 points

3 comments39 min readEA link

What would an animal-aligned AI be aligned to?

Aidan Kankyoku30 Jun 2026 17:24 UTC

58 points

1 comment12 min readEA link

Interpreting Neural Networks through the Polytope Lens

Sid Black23 Sep 2022 18:03 UTC

35 points

0 comments28 min readEA link

Learning societal values from law as part of an AGI alignment strategy

johnjnay21 Oct 2022 2:03 UTC

20 points

1 comment24 min readEA link

There should be an AI safety project board

mariushobbhahn14 Mar 2022 16:08 UTC

24 points

3 comments1 min readEA link

AI Risk: Increasing Persuasion Power

kewlcats3 Aug 2020 20:25 UTC

4 points

0 comments1 min readEA link

AI alignment with humans… but with which humans?

Geoffrey Miller8 Sep 2022 23:43 UTC

51 points

20 comments3 min readEA link

We Are Conjecture, A New Alignment Research Startup

Connor Leahy9 Apr 2022 15:07 UTC

31 points

0 comments1 min readEA link

Parallels Between AI Safety by Debate and Evidence Law

Cullen 🔸20 Jul 2020 22:52 UTC

30 points

2 comments2 min readEA link

(cullenokeefe.com)

Safe AI and moral AI

William D'Alessandro1 Jun 2023 21:18 UTC

3 points

0 comments11 min readEA link

(Even) More Early-Career EAs Should Try AI Safety Technical Research

tlevin30 Jun 2022 21:14 UTC

86 points

40 comments11 min readEA link

2020 AI Alignment Literature Review and Charity Comparison

Larks21 Dec 2020 15:25 UTC

155 points

16 comments68 min readEA link

Connor Leahy on Conjecture and Dying with Dignity

Michaël Trazzi22 Jul 2022 19:30 UTC

34 points

0 comments10 min readEA link

(theinsideview.ai)

Relevant pre-AGI possibilities

kokotajlod20 Jun 2020 13:15 UTC

22 points

0 comments1 min readEA link

(aiimpacts.org)

Why Would AI “Aim” To Defeat Humanity?

Holden Karnofsky29 Nov 2022 18:59 UTC

24 points

0 comments32 min readEA link

(www.cold-takes.com)

High-level hopes for AI alignment

Holden Karnofsky20 Dec 2022 2:11 UTC

123 points

14 comments19 min readEA link

(www.cold-takes.com)

Possible OpenAI’s Q* breakthrough and DeepMind’s AlphaGo-type systems plus LLMs

Burny_23 Nov 2023 7:02 UTC

13 points

4 comments2 min readEA link

[Question] How strong is the evidence of unaligned AI systems causing harm?

Eevee🔹21 Jul 2020 4:08 UTC

31 points

1 comment1 min readEA link

New report on how much computational power it takes to match the human brain (Open Philanthropy)

Aaron Gertler 🔸15 Sep 2020 1:06 UTC

46 points

1 comment18 min readEA link

(www.openphilanthropy.org)

Paul Christiano: Current work in AI alignment

EA Global3 Apr 2020 7:06 UTC

80 points

4 comments24 min readEA link

(www.youtube.com)

Buck Shlegeris: How I think students should orient to AI safety

EA Global25 Oct 2020 5:48 UTC

11 points

0 comments1 min readEA link

(www.youtube.com)

The basic reasons I expect AGI ruin

RobBensinger18 Apr 2023 3:37 UTC

58 points

13 comments14 min readEA link

The current alignment plan, and how we might improve it | EAG Bay Area 23

Buck7 Jun 2023 21:03 UTC

66 points

0 comments33 min readEA link

“The Race to the End of Humanity” – Structural Uncertainty Analysis in AI Risk Models

Froolow19 May 2023 12:03 UTC

48 points

4 comments21 min readEA link

Conjecture: Internal Infohazard Policy

Connor Leahy29 Jul 2022 19:35 UTC

34 points

3 comments19 min readEA link

[Link] How understanding valence could help make future AIs safer

Milan Griffes8 Oct 2020 18:53 UTC

24 points

2 comments3 min readEA link

Aligning the Aligners: Ensuring Aligned AI acts for the common good of all mankind

timunderwood16 Jan 2023 11:13 UTC

40 points

2 comments4 min readEA link

My Objections to “We’re All Gonna Die with Eliezer Yudkowsky”

Quintin Pope21 Mar 2023 1:23 UTC

166 points

21 comments39 min readEA link

EA, Psychology & AI Safety Research

Sam Ellis26 May 2022 23:46 UTC

29 points

3 comments6 min readEA link

Why the Orthogonality Thesis’s veracity is not the point:

Antoine de Scorraille ⏸️23 Jul 2020 15:40 UTC

3 points

0 comments3 min readEA link

Apply to the second ML for Alignment Bootcamp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2]

Buck6 May 2022 0:19 UTC

111 points

7 comments6 min readEA link

Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22]

Habryka [Deactivated]3 Nov 2021 18:20 UTC

140 points

6 comments1 min readEA link

Speedrun: AI Alignment Prizes

joe9 Feb 2023 11:55 UTC

27 points

0 comments17 min readEA link

Steering AI to care for animals, and soon

Andrew Critch14 Jun 2022 1:13 UTC

239 points

37 comments1 min readEA link

Predict responses to the “existential risk from AI” survey

RobBensinger28 May 2021 1:38 UTC

36 points

8 comments2 min readEA link

Aspiration-based, non-maximizing AI agent designs

Bob Jacobs7 May 2024 16:13 UTC

12 points

1 comment38 min readEA link

Misgeneralization as a misnomer

So8res6 Apr 2023 20:43 UTC

48 points

0 comments4 min readEA link

Final Report of the National Security Commission on Artificial Intelligence (NSCAI, 2021)

MichaelA🔸1 Jun 2021 8:19 UTC

51 points

3 comments4 min readEA link

(www.nscai.gov)

New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

Joe_Carlsmith15 Nov 2023 17:16 UTC

71 points

4 comments30 min readEA link

Takeaways from safety by default interviews

AI Impacts7 Apr 2020 2:01 UTC

25 points

2 comments13 min readEA link

(aiimpacts.org)

Naturalism and AI alignment

Michele Campolo24 Apr 2021 16:20 UTC

17 points

3 comments7 min readEA link

VIRTUA: a novel about AI alignment

Karl von Wendt12 Jan 2023 9:37 UTC

23 points

0 comments1 min readEA link

Emergent Ventures AI

technicalities8 Apr 2022 22:08 UTC

22 points

0 comments1 min readEA link

(marginalrevolution.com)

AI Sleeper Agents: How Anthropic Trains and Catches Them—Video

Writer30 Aug 2025 17:52 UTC

7 points

1 comment7 min readEA link

(youtu.be)

Guardrails vs Goal-directedness in AI Alignment

freedomandutility30 Dec 2023 12:58 UTC

13 points

2 comments1 min readEA link

What I mean by “alignment is in large part about making cognition aimable at all”

So8res30 Jan 2023 15:22 UTC

57 points

3 comments2 min readEA link

Law-Following AI 2: Intent Alignment + Superintelligence → Lawless AI (By Default)

Cullen 🔸27 Apr 2022 17:18 UTC

19 points

0 comments6 min readEA link

Is AI forecasting a waste of effort on the margin?

Emrik5 Nov 2022 0:41 UTC

12 points

6 comments3 min readEA link

How to get technological knowledge on AI/ML (for non-tech people)

FangFang30 Jun 2021 7:53 UTC

63 points

7 comments5 min readEA link

Andrew Critch: Logical induction — progress in AI alignment

EA Global6 Aug 2016 0:40 UTC

7 points

0 comments1 min readEA link

(www.youtube.com)

Critical Review of ‘The Precipice’: A Reassessment of the Risks of AI and Pandemics

James Fodor11 May 2020 11:11 UTC

111 points

32 comments26 min readEA link

Pile of Law and Law-Following AI

Cullen 🔸13 Jul 2022 0:29 UTC

28 points

2 comments3 min readEA link

Community Building for Graduate Students: A Targeted Approach

Neil Crawford29 Mar 2022 19:47 UTC

13 points

0 comments3 min readEA link

[Question] If AIs had subcortical brain simulation, would that solve the alignment problem?

Rainbow Affect31 Jul 2023 15:48 UTC

1 point

0 comments2 min readEA link

Quick survey on AI alignment resources

Fran30 Jun 2022 19:08 UTC

15 points

0 comments1 min readEA link

[Question] How should we invest in “long-term short-termism” given the likelihood of transformative AI?

James_Banks12 Jan 2021 23:54 UTC

8 points

0 comments1 min readEA link

Three Impacts of Machine Intelligence

Paul_Christiano23 Aug 2013 10:10 UTC

33 points

5 comments8 min readEA link

(rationalaltruist.com)

Eric Drexler: Paretotopian goal alignment

EA Global15 Mar 2019 14:51 UTC

16 points

0 comments10 min readEA link

(www.youtube.com)

On AI and Compute

johncrox3 Apr 2019 21:26 UTC

39 points

12 comments8 min readEA link

Mauhn Releases AI Safety Documentation

Berg Severens2 Jul 2021 12:19 UTC

4 points

2 comments1 min readEA link

LLMs might not be the future of search: at least, not yet.

James-Hartree22 Jan 2025 21:40 UTC

4 points

1 comment4 min readEA link

[Question] What are your recommendations for technical AI alignment podcasts?

Evan_Gaensbauer11 May 2022 21:52 UTC

13 points

4 comments1 min readEA link

Max Tegmark: Risks and benefits of advanced artificial intelligence

EA Global5 Aug 2016 9:19 UTC

7 points

0 comments1 min readEA link

(www.youtube.com)

[meta on] The simplest case for AI catastrophe

Linch6 Feb 2026 0:19 UTC

23 points

2 comments1 min readEA link

(linch.substack.com)

Defining alignment research

richard_ngo19 Aug 2024 22:49 UTC

48 points

1 comment7 min readEA link

[Question] Is there evidence that recommender systems are changing users’ preferences?

zdgroff12 Apr 2021 19:11 UTC

60 points

15 comments1 min readEA link

Discontinuous progress in history: an update

AI Impacts17 Apr 2020 16:28 UTC

69 points

3 comments24 min readEA link

Large Language Models as Corporate Lobbyists, and Implications for Societal-AI Alignment

johnjnay4 Jan 2023 22:22 UTC

10 points

6 comments8 min readEA link

AGI x-risk timelines: 10% chance (by year X) estimates should be the headline, not 50%.

Greg_Colbourn ⏸️ 1 Mar 2022 12:02 UTC

69 points

22 comments2 min readEA link

[Question] Why should we not put effort into AI safety research?

Ben Thompson16 May 2021 5:11 UTC

15 points

5 comments1 min readEA link

[Question] Are we confident that superintelligent artificial intelligence disempowering humans would be bad?

Vasco Grilo🔸10 Jun 2023 9:24 UTC

24 points

27 comments1 min readEA link

When “yang” goes wrong

Joe_Carlsmith8 Jan 2024 16:35 UTC

57 points

1 comment13 min readEA link

[Question] How can I bet on short timelines?

kokotajlod7 Nov 2020 12:45 UTC

33 points

12 comments2 min readEA link

Order Matters for Deceptive Alignment

DavidW15 Feb 2023 20:12 UTC

20 points

1 comment1 min readEA link

(www.lesswrong.com)

[Question] Alignment & Capabilities: What’s the difference?

John G. Halstead31 Aug 2023 22:13 UTC

50 points

10 comments1 min readEA link

Action: Help expand funding for AI Safety by coordinating on NSF response

Evan R. Murphy20 Jan 2022 20:48 UTC

20 points

7 comments3 min readEA link

The Metaethics and Normative Ethics of AGI Value Alignment: Many Questions, Some Implications

Eleos Arete Citrini15 Sep 2021 19:05 UTC

25 points

0 comments8 min readEA link

Brain-computer interfaces and brain organoids in AI alignment?

freedomandutility15 Apr 2023 22:28 UTC

8 points

2 comments1 min readEA link

Shah and Yudkowsky on alignment failures

EliezerYudkowsky28 Feb 2022 19:25 UTC

38 points

7 comments92 min readEA link

The Problem With the Word ‘Alignment’

Peli Grietzer21 May 2024 21:37 UTC

13 points

1 comment6 min readEA link

[Creative Writing Contest] An AI Safety Limerick

Ben_West🔸18 Oct 2021 19:11 UTC

21 points

5 comments1 min readEA link

Situational awareness (Section 2.1 of “Scheming AIs”)

Joe_Carlsmith26 Nov 2023 23:00 UTC

12 points

1 comment6 min readEA link

Alignment Bootstrapping Is Dangerous

MichaelDickens27 Nov 2025 18:18 UTC

14 points

0 comments2 min readEA link

Helen Toner: The Open Philanthropy Project’s work on AI risk

EA Global3 Nov 2017 7:43 UTC

7 points

0 comments1 min readEA link

(www.youtube.com)

Public-facing Censorship Is Safety Theater, Causing Reputational Damage

Yitz23 Sep 2022 5:08 UTC

49 points

7 comments5 min readEA link

[Question] What kind of event, targeted to undergraduate CS majors, would be most effective at getting people to work on AI safety?

CBiddulph19 Sep 2021 16:19 UTC

9 points

1 comment1 min readEA link

Lessons learned from talking to >100 academics about AI safety

mariushobbhahn10 Oct 2022 13:16 UTC

138 points

21 comments12 min readEA link

I’m Cullen O’Keefe, a Policy Researcher at OpenAI, AMA

Cullen 🔸11 Jan 2020 4:13 UTC

45 points

68 comments1 min readEA link

What does (and doesn’t) AI mean for effective altruism?

EA Global12 Aug 2017 7:00 UTC

9 points

0 comments12 min readEA link

[Question] Is this a good way to bet on short timelines?

kokotajlod28 Nov 2020 14:31 UTC

17 points

16 comments1 min readEA link

[Question] Should the EA community have a DL engineering fellowship?

PAMC 🔸24 Dec 2021 13:43 UTC

26 points

6 comments1 min readEA link

The Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL)

Miguel19 Jun 2023 3:23 UTC

4 points

0 comments7 min readEA link

EA megaprojects continued

mariushobbhahn3 Dec 2021 10:33 UTC

183 points

48 comments7 min readEA link

A mesa-optimization perspective on AI valence and moral patienthood

jacobpfau9 Sep 2021 22:23 UTC

10 points

18 comments17 min readEA link

[Question] What would you do if you had a lot of money/power/influence and you thought that AI timelines were very short?

Greg_Colbourn ⏸️ 12 Nov 2021 21:59 UTC

29 points

8 comments1 min readEA link

Quantifying the Far Future Effects of Interventions

MichaelDickens18 May 2016 2:15 UTC

9 points

0 comments11 min readEA link

What does it mean for an AGI to be ‘safe’?

So8res7 Oct 2022 4:43 UTC

53 points

21 comments3 min readEA link

AI safety tax dynamics

Owen Cotton-Barratt23 Oct 2024 12:21 UTC

22 points

9 comments6 min readEA link

(strangecities.substack.com)

Alignment Stress Signatures: When Safe AI Behaves Like It’s Traumatized

PV526 Oct 2025 9:41 UTC

8 points

0 comments2 min readEA link

Introducing the Principles of Intelligent Behaviour in Biological and Social Systems (PIBBSS) Fellowship

adamShimi18 Dec 2021 15:25 UTC

37 points

5 comments10 min readEA link

[Cause Exploration Prizes] Expanding communication about AGI risks

Ines22 Sep 2022 5:30 UTC

13 points

0 comments11 min readEA link

Shallow review of live agendas in alignment & safety

technicalities27 Nov 2023 11:33 UTC

76 points

8 comments29 min readEA link

Some AI Governance Research Ideas

MarkusAnderljung3 Jun 2021 10:51 UTC

102 points

5 comments2 min readEA link

Soares, Tallinn, and Yudkowsky discuss AGI cognition

EliezerYudkowsky29 Nov 2021 17:28 UTC

15 points

0 comments40 min readEA link

[Question] Career Advice: Philosophy + Programming → AI Safety

tcelferact18 Mar 2022 15:09 UTC

30 points

11 comments2 min readEA link

Artificial intelligence career stories

EA Global25 Oct 2020 6:56 UTC

12 points

0 comments1 min readEA link

(www.youtube.com)

Christiano and Yudkowsky on AI predictions and human intelligence

EliezerYudkowsky23 Feb 2022 16:51 UTC

31 points

0 comments42 min readEA link

[Question] What is an example of recent, tangible progress in AI safety research?

Aaron Gertler 🔸14 Jun 2021 5:29 UTC

35 points

4 comments1 min readEA link

Compendium of problems with RLHF

Raphaël S30 Jan 2023 8:48 UTC

18 points

0 comments10 min readEA link

Sharing the World with Digital Minds

Aaron Gertler 🔸1 Dec 2020 8:00 UTC

12 points

1 comment1 min readEA link

(www.nickbostrom.com)

Coherence arguments imply a force for goal-directed behavior

Katja_Grace6 Apr 2021 21:44 UTC

19 points

1 comment11 min readEA link

(worldspiritsockpuppet.com)

[linkpost] Sharing powerful AI models: the emerging paradigm of structured access

ts20 Jan 2022 21:10 UTC

11 points

3 comments1 min readEA link

Information security careers for GCR reduction

ClaireZabel20 Jun 2019 23:56 UTC

187 points

35 comments8 min readEA link

Survey on AI existential risk scenarios

Sam Clarke8 Jun 2021 17:12 UTC

159 points

11 comments6 min readEA link

Key Papers in Language Model Safety

aog20 Jun 2022 14:59 UTC

20 points

0 comments22 min readEA link

[Question] What are the challenges and problems with programming law-breaking constraints into AGI?

Michael St Jules 🔸2 Feb 2020 20:53 UTC

20 points

34 comments1 min readEA link

Consider paying me to do AI safety research work

Rupert5 Nov 2020 8:09 UTC

11 points

3 comments2 min readEA link

Some global catastrophic risk estimates

Tamay10 Feb 2021 19:32 UTC

106 points

15 comments1 min readEA link

Katja Grace: AI safety

EA Global11 Aug 2017 8:19 UTC

7 points

0 comments1 min readEA link

(www.youtube.com)

CFP for Rebellion and Disobedience in AI workshop

Ram Rachum29 Dec 2022 16:09 UTC

4 points

0 comments1 min readEA link

Tan Zhi Xuan: AI alignment, philosophical pluralism, and the relevance of non-Western philosophy

EA Global21 Nov 2020 8:12 UTC

20 points

1 comment1 min readEA link

(www.youtube.com)

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Rohin Shah3 Jan 2020 7:52 UTC

58 points

12 comments10 min readEA link

(www.alignmentforum.org)

Jesse Clifton: Open-source learning — a bargaining approach

EA Global18 Oct 2019 18:05 UTC

10 points

0 comments1 min readEA link

(www.youtube.com)

AI things that are perhaps as important as human-controlled AI

Chi3 Mar 2024 18:07 UTC

117 points

9 comments21 min readEA link

An Analysis of Systemic Risk and Architectural Requirements for the Containment of Recursively Self-Improving AI

Ihor Ivliev17 Jun 2025 0:16 UTC

2 points

5 comments4 min readEA link

Law-Following AI 3: Lawless AI Agents Undermine Stabilizing Agreements

Cullen 🔸27 Apr 2022 17:20 UTC

28 points

3 comments3 min readEA link

[Linkpost] How To Get Into Independent Research On Alignment/Agency

Jackson Wagner14 Feb 2022 21:40 UTC

10 points

0 comments1 min readEA link

On the abolition of man

Joe_Carlsmith18 Jan 2024 18:17 UTC

71 points

4 comments41 min readEA link

The Parable of the Boy Who Cried 5% Chance of Wolf

Kat Woods 🔶 ⏸️15 Aug 2022 14:22 UTC

80 points

8 comments2 min readEA link

Intent alignment should not be the goal for AGI x-risk reduction

johnjnay26 Oct 2022 1:24 UTC

7 points

1 comment2 min readEA link

How to pursue a career in technical AI alignment

Charlie Rogers-Smith4 Jun 2022 21:36 UTC

270 points

9 comments39 min readEA link

Jan Leike, Helen Toner, Malo Bourgon, and Miles Brundage: Working in AI

EA Global11 Aug 2017 8:19 UTC

7 points

0 comments1 min readEA link

(www.youtube.com)

Getting started independently in AI Safety

JJ Hepburn6 Jul 2021 15:20 UTC

41 points

10 comments2 min readEA link

Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope

Greg_Colbourn ⏸️ 12 Oct 2023 11:24 UTC

79 points

83 comments9 min readEA link

Sydney AI Safety Fellowship

Chris Leong2 Dec 2021 7:35 UTC

16 points

0 comments2 min readEA link

AGI Predictions

Pablo21 Nov 2020 12:02 UTC

36 points

0 comments1 min readEA link

(www.lesswrong.com)

On presenting the case for AI risk

Aryeh Englander8 Mar 2022 21:37 UTC

114 points

12 comments4 min readEA link

List #3: Why not to assume on prior that AGI-alignment workarounds are available

Remmelt24 Dec 2022 9:54 UTC

6 points

0 comments3 min readEA link

[Question] Is it crunch time yet? If so, who can help?

Nicholas Kross13 Oct 2021 4:11 UTC

29 points

9 comments1 min readEA link

Don’t Call It AI Alignment

Gil20 Feb 2023 5:27 UTC

16 points

7 comments2 min readEA link

[Question] Are alignment researchers devoting enough time to improving their research capacity?

Carson Jones4 Nov 2022 0:58 UTC

11 points

1 comment3 min readEA link

The case for more Alignment Target Analysis (ATA)

Chi20 Sep 2024 1:14 UTC

33 points

0 comments17 min readEA link

Ngo and Yudkowsky on AI capability gains

richard_ngo19 Nov 2021 1:54 UTC

23 points

4 comments39 min readEA link

Otherness and control in the age of AGI

Joe_Carlsmith2 Jan 2024 18:15 UTC

37 points

1 comment7 min readEA link

[Question] I’m interviewing Max Tegmark about AI safety and more. What shouId I ask him?

Robert_Wiblin13 May 2022 15:32 UTC

18 points

2 comments1 min readEA link

Long-Term Future Fund: May 2021 grant recommendations

abergal27 May 2021 6:44 UTC

110 points

17 comments57 min readEA link

How Do AI Timelines Affect Giving Now vs. Later?

MichaelDickens3 Aug 2021 3:36 UTC

36 points

8 comments8 min readEA link

Bryan Johnson seems more EA aligned than I expected

Peter Slattery 🔸22 Apr 2024 9:38 UTC

13 points

27 comments2 min readEA link

(www.youtube.com)

[Question] What considerations influence whether I have more influence over short or long timelines?

kokotajlod5 Nov 2020 19:57 UTC

19 points

0 comments1 min readEA link

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Matrice Jacobine🔸🏳️‍⚧️12 Feb 2025 9:15 UTC

13 points

0 comments1 min readEA link

(www.emergent-values.ai)

Gentleness and the artificial Other

Joe_Carlsmith2 Jan 2024 18:21 UTC

90 points

2 comments11 min readEA link

Why AI is Harder Than We Think—Melanie Mitchell

Eevee🔹28 Apr 2021 8:19 UTC

45 points

7 comments2 min readEA link

(arxiv.org)

Thoughts on short timelines

Tobias_Baumann23 Oct 2018 15:59 UTC

22 points

14 comments5 min readEA link

Symbiosis, not alignment, as the goal for liberal democracies in the transition to artificial general intelligence

simonfriederich17 Mar 2023 13:04 UTC

18 points

2 comments24 min readEA link

(rdcu.be)

Important, actionable research questions for the most important century

Holden Karnofsky24 Feb 2022 16:34 UTC

301 points

13 comments19 min readEA link

SERI ML application deadline is extended until May 22.

Viktoria Malyasova22 May 2022 0:13 UTC

13 points

3 comments1 min readEA link

Victoria Krakovna on AGI Ruin, The Sharp Left Turn and Paradigms of AI Alignment

Michaël Trazzi12 Jan 2023 17:09 UTC

16 points

0 comments4 min readEA link

(www.theinsideview.ai)

AI alignment research links

Holden Karnofsky6 Jan 2022 5:52 UTC

16 points

0 comments6 min readEA link

(www.cold-takes.com)

Messy personal stuff that affected my cause prioritization (or: how I started to care about AI safety)

Julia_Wise🔸5 May 2022 17:59 UTC

270 points

14 comments2 min readEA link

Technical AGI safety research outside AI

richard_ngo18 Oct 2019 15:02 UTC

91 points

5 comments3 min readEA link

Why Moral Weights Have Two Types and How to Measure Them

Beyond Singularity17 Jul 2025 10:58 UTC

17 points

4 comments4 min readEA link

Some promising career ideas beyond 80,000 Hours’ priority paths

Arden Koehler26 Jun 2020 10:34 UTC

142 points

28 comments15 min readEA link

Law-Following AI 1: Sequence Introduction and Structure

Cullen 🔸27 Apr 2022 17:16 UTC

35 points

2 comments9 min readEA link

Increased Availability and Willingness for Deployment of Resources for Effective Altruism and Long-Termism

Evan_Gaensbauer29 Dec 2021 20:20 UTC

46 points

1 comment2 min readEA link

7 essays on Building a Better Future

Jamie_Harris24 Jun 2022 14:28 UTC

21 points

0 comments2 min readEA link

Seeking Feedback: An Initiative on AI, Mental Health, and Alignment

Gina Hafez30 Sep 2025 16:14 UTC

16 points

6 comments6 min readEA link

Video and transcript of talk on automating alignment research

Joe_Carlsmith30 Apr 2025 17:43 UTC

11 points

1 comment24 min readEA link

(joecarlsmith.com)

On the correspondence between AI-misalignment and cognitive dissonance using a behavioral economics model

Stijn Bruers 🔸1 Nov 2022 9:15 UTC

11 points

0 comments6 min readEA link

Eli Lifland on Navigating the AI Alignment Landscape

Ozzie Gooen1 Feb 2023 0:07 UTC

48 points

9 comments31 min readEA link

(quri.substack.com)

“Existential risk from AI” survey results

RobBensinger1 Jun 2021 20:19 UTC

80 points

35 comments11 min readEA link

Ngo and Yudkowsky on scientific reasoning and pivotal acts

EliezerYudkowsky21 Feb 2022 17:00 UTC

33 points

1 comment35 min readEA link

[Question] Is transformative AI the biggest existential risk? Why or why not?

Eevee🔹5 Mar 2022 3:54 UTC

9 points

10 comments1 min readEA link

A Simple Model of AGI Deployment Risk

djbinder9 Jul 2021 9:44 UTC

37 points

0 comments5 min readEA link

An ML safety insurance company—shower thoughts

EdoArad🔸18 Oct 2021 7:45 UTC

15 points

4 comments1 min readEA link

AI Safety Needs Great Engineers

Andy Jones23 Nov 2021 21:03 UTC

98 points

14 comments4 min readEA link

How to build a safe advanced AI (Evan Hubinger) | What’s up in AI safety? (Asya Bergal)

EA Global25 Oct 2020 5:48 UTC

7 points

0 comments1 min readEA link

(www.youtube.com)

AI alignment prize winners and next round [link]

RyanCarey20 Jan 2018 12:07 UTC

7 points

1 comment1 min readEA link

FLI AI Alignment podcast: Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

evhub1 Jul 2020 20:59 UTC

13 points

2 comments1 min readEA link

(futureoflife.org)

[Link] EAF Research agenda: “Cooperation, Conflict, and Transformative Artificial Intelligence”

stefan.torges17 Jan 2020 13:28 UTC

64 points

0 comments1 min readEA link

I’m Buck Shlegeris, I do research and outreach at MIRI, AMA

Buck15 Nov 2019 22:44 UTC

123 points

228 comments2 min readEA link

AI Safety: Applying to Graduate Studies

Fran15 Dec 2021 22:56 UTC

24 points

0 comments12 min readEA link

Atari early

AI Impacts2 Apr 2020 23:28 UTC

34 points

2 comments5 min readEA link

(aiimpacts.org)

[Question] What harm could AI safety do?

SeanEngelhart15 May 2021 1:11 UTC

12 points

7 comments1 min readEA link

[Question] The positive case for a focus on achieving safe AI?

vipulnaik25 Jun 2021 4:01 UTC

41 points

1 comment1 min readEA link

Cosmic AI safety

Magnus Vinding6 Dec 2024 22:32 UTC

24 points

5 comments6 min readEA link

[Question] Why aren’t you freaking out about OpenAI? At what point would you start?

AppliedDivinityStudies10 Oct 2021 13:06 UTC

80 points

22 comments2 min readEA link

There are two factions working to prevent AI dangers. Here’s why they’re deeply divided.

Sharmake10 Aug 2022 19:52 UTC

10 points

0 comments4 min readEA link

(www.vox.com)

Is GPT-3 the death of the paperclip maximizer?

matthias_samwald3 Aug 2020 11:34 UTC

4 points

1 comment1 min readEA link

Owen Cotton-Barratt: What does (and doesn’t) AI mean for effective altruism?

EA Global11 Aug 2017 8:19 UTC

10 points

0 comments12 min readEA link

(www.youtube.com)

Alignment Newsletter One Year Retrospective

Rohin Shah10 Apr 2019 7:00 UTC

62 points

22 comments21 min readEA link

Mahendra Prasad: Rational group decision-making

EA Global8 Jul 2020 15:06 UTC

15 points

0 comments16 min readEA link

(www.youtube.com)

List #1: Why stopping the development of AGI is hard but doable

Remmelt24 Dec 2022 9:52 UTC

24 points

2 comments5 min readEA link

Conversation on AI risk with Adam Gleave

AI Impacts27 Dec 2019 21:43 UTC

18 points

3 comments4 min readEA link

(aiimpacts.org)

A list of good heuristics that the case for AI X-risk fails

Aaron Gertler 🔸16 Jul 2020 9:56 UTC

25 points

9 comments2 min readEA link

(www.alignmentforum.org)

Meditations on careers in AI Safety

PAMC 🔸23 Mar 2022 22:00 UTC

88 points

30 comments2 min readEA link

AI Moral Alignment: The Most Important Goal of Our Generation

Ronen Bar26 Mar 2025 12:32 UTC

137 points

32 comments8 min readEA link

What does it mean to become an expert in AI Hardware?

Toph9 Jan 2021 4:15 UTC

87 points

10 comments11 min readEA link

Twitter-length responses to 24 AI alignment arguments

RobBensinger14 Mar 2022 19:34 UTC

67 points

17 comments8 min readEA link

Who Aligns the Alignment Researchers?

ben.smith5 Mar 2023 23:22 UTC

23 points

4 comments11 min readEA link

VSPE vs. flattery: Testing emotional scaffolding for early-stage alignment

Astelle Kay24 Jun 2025 9:39 UTC

2 points

1 comment1 min readEA link

Potential Risks from Advanced AI

EA Global13 Aug 2017 7:00 UTC

9 points

0 comments18 min readEA link

AI Alignment: The Case for Including Animals

Adrià Moret11 Sep 2025 20:59 UTC

22 points

0 comments1 min readEA link

(philpapers.org)

What success looks like

mariushobbhahn28 Jun 2022 14:30 UTC

117 points

20 comments19 min readEA link

Forecasting Transformative AI: What Kind of AI?

Holden Karnofsky10 Aug 2021 21:38 UTC

62 points

3 comments10 min readEA link

AGI in a vulnerable world

AI Impacts2 Apr 2020 3:43 UTC

17 points

0 comments1 min readEA link

(aiimpacts.org)

List #2: Why coordinating to align as humans to not develop AGI is a lot easier than, well… coordinating as humans with AGI coordinating to be aligned with humans

Remmelt24 Dec 2022 9:53 UTC

3 points

0 comments3 min readEA link

Aligning Recommender Systems as Cause Area

IvanVendrov8 May 2019 8:56 UTC

150 points

48 comments13 min readEA link

Disagreements about Alignment: Why, and how, we should try to solve them

ojorgensen8 Aug 2022 22:32 UTC

16 points

6 comments16 min readEA link

[Question] Brief summary of key disagreements in AI Risk

Aryeh Englander26 Dec 2019 19:40 UTC

31 points

3 comments1 min readEA link

Nobody’s on the ball on AGI alignment

leopold29 Mar 2023 14:26 UTC

329 points

66 comments9 min readEA link

(www.forourposterity.com)

A Conflict Between AI Alignment and Philosophical Competence

Wei Dai27 Dec 2025 21:32 UTC

41 points

2 comments2 min readEA link

Some AI research areas and their relevance to existential safety

Andrew Critch15 Dec 2020 12:15 UTC

12 points

1 comment56 min readEA link

(alignmentforum.org)

What Should the Average EA Do About AI Alignment?

Raemon25 Feb 2017 20:07 UTC

43 points

39 comments7 min readEA link

Draft report on AI timelines

Ajeya15 Dec 2020 12:10 UTC

35 points

0 comments1 min readEA link

(alignmentforum.org)

The Importance of AI Alignment, explained in 5 points

Daniel_Eth11 Feb 2023 2:56 UTC

50 points

4 comments13 min readEA link

Projects I would like to see (possibly at AI Safety Camp)

Linda Linsefors27 Sep 2023 21:27 UTC

9 points

0 comments4 min readEA link

Discussion with Eliezer Yudkowsky on AGI interventions

RobBensinger11 Nov 2021 3:21 UTC

60 points

33 comments34 min readEA link

Consider trying the ELK contest (I am)

Holden Karnofsky5 Jan 2022 19:42 UTC

110 points

17 comments16 min readEA link

The case for becoming a black-box investigator of language models

Buck6 May 2022 14:37 UTC

91 points

7 comments3 min readEA link

13 Very Different Stances on AGI

Ozzie Gooen27 Dec 2021 23:30 UTC

84 points

23 comments3 min readEA link

Daniel Dewey: The Open Philanthropy Project’s work on potential risks from advanced AI

EA Global11 Aug 2017 8:19 UTC

7 points

0 comments18 min readEA link

(www.youtube.com)

[Question] Is a career in making AI systems more secure a meaningful way to mitigate the X-risk posed by AGI?

Kyle O’Brien13 Feb 2022 7:05 UTC

14 points

4 comments1 min readEA link

Redwood Research is hiring for several roles

Jack R29 Nov 2021 0:18 UTC

75 points

0 comments1 min readEA link

An even deeper atheism

Joe_Carlsmith11 Jan 2024 17:28 UTC

26 points

2 comments15 min readEA link

Why I expect successful (narrow) alignment

Tobias_Baumann29 Dec 2018 15:46 UTC

18 points

10 comments1 min readEA link

(s-risks.org)

Owain Evans and Victoria Krakovna: Careers in technical AI safety

EA Global3 Nov 2017 7:43 UTC

7 points

0 comments1 min readEA link

(www.youtube.com)

AI safety university groups: a promising opportunity to reduce existential risk

mic30 Jun 2022 18:37 UTC

53 points

0 comments11 min readEA link

Announcing the Vitalik Buterin Fellowships in AI Existential Safety!

DanielFilan21 Sep 2021 0:41 UTC

62 points

0 comments1 min readEA link

(grants.futureoflife.org)

Long-Term Future Fund: April 2019 grant recommendations

Habryka [Deactivated]23 Apr 2019 7:00 UTC

142 points

242 comments47 min readEA link

Truthful AI

Owen Cotton-Barratt20 Oct 2021 15:11 UTC

55 points

14 comments10 min readEA link

Does AI risk “other” the AIs?

Joe_Carlsmith9 Jan 2024 17:51 UTC

23 points

3 comments8 min readEA link

New blog: Planned Obsolescence

Ajeya27 Mar 2023 19:46 UTC

198 points

9 comments1 min readEA link

(www.planned-obsolescence.org)

Imitation Learning is Probably Existentially Safe

Vasco Grilo🔸30 Apr 2024 17:06 UTC

19 points

7 comments3 min readEA link

(www.openphilanthropy.org)

AI views and disagreements AMA: Christiano, Ngo, Shah, Soares, Yudkowsky

RobBensinger1 Mar 2022 1:13 UTC

30 points

4 comments1 min readEA link

(www.lesswrong.com)

Yudkowsky and Christiano discuss “Takeoff Speeds”

EliezerYudkowsky22 Nov 2021 19:42 UTC

42 points

0 comments60 min readEA link

BERI is hiring an ML Software Engineer

sawyer🔸10 Nov 2021 19:36 UTC

17 points

2 comments1 min readEA link

Christiano, Cotra, and Yudkowsky on AI progress

Ajeya25 Nov 2021 16:30 UTC

18 points

6 comments68 min readEA link

Language Agents Reduce the Risk of Existential Catastrophe

cdkg29 May 2023 9:59 UTC

29 points

6 comments26 min readEA link

“Slower tech development” can be about ordering, gradualness, or distance from now

MichaelA🔸14 Nov 2021 20:58 UTC

47 points

3 comments4 min readEA link

Personal thoughts on careers in AI policy and strategy

carrickflynn27 Sep 2017 16:52 UTC

56 points

28 comments18 min readEA link

Collin Burns on Alignment Research And Discovering Latent Knowledge Without Supervision

Michaël Trazzi17 Jan 2023 17:21 UTC

21 points

2 comments4 min readEA link

(theinsideview.ai)

Three kinds of competitiveness

AI Impacts2 Apr 2020 3:46 UTC

10 points

0 comments5 min readEA link

(aiimpacts.org)

Ought: why it matters and ways to help

Paul_Christiano26 Jul 2019 1:56 UTC

52 points

5 comments5 min readEA link

How Misaligned AI Personas Lead to Human Extinction – Step by Step

Writer19 Jul 2025 13:59 UTC

6 points

1 comment7 min readEA link

(youtu.be)

Two reasons we might be closer to solving alignment than it seems

Kat Woods 🔶 ⏸️24 Sep 2022 17:38 UTC

44 points

17 comments4 min readEA link

Announcing the Harvard AI Safety Team

Xander12330 Jun 2022 18:34 UTC

128 points

4 comments5 min readEA link

[Question] What are the top priorities in a slow-takeoff, multipolar world?

JP Addison🔸25 Aug 2021 8:47 UTC

26 points

9 comments1 min readEA link

How I Formed My Own Views About AI Safety

Neel Nanda27 Feb 2022 18:52 UTC

134 points

12 comments14 min readEA link

(www.neelnanda.io)

Is this community over-emphasizing AI alignment?

Lixiang8 Jan 2023 6:23 UTC

1 point

5 comments1 min readEA link

AI Impacts: Historic trends in technological progress

Aaron Gertler 🔸12 Feb 2020 0:08 UTC

55 points

5 comments3 min readEA link

Informatica: Special Issue on Superintelligence

RyanCarey3 May 2017 5:05 UTC

7 points

0 comments2 min readEA link

Michael Page, Dario Amodei, Helen Toner, Tasha McCauley, Jan Leike, & Owen Cotton-Barratt: Musings on AI

EA Global11 Aug 2017 8:19 UTC

7 points

0 comments1 min readEA link

(www.youtube.com)

SERI ML Alignment Theory Scholars Program 2022

Ryan Kidd27 Apr 2022 16:33 UTC

57 points

2 comments3 min readEA link

Racing through a minefield: the AI deployment problem

Holden Karnofsky31 Dec 2022 21:44 UTC

79 points

1 comment13 min readEA link

(www.cold-takes.com)

Open Philanthropy’s AI governance grantmaking (so far)

Aaron Gertler 🔸17 Dec 2020 12:00 UTC

63 points

0 comments6 min readEA link

(www.openphilanthropy.org)

De Dicto and De Se Reference Matters for Alignment

philgoetz3 Oct 2023 21:57 UTC

5 points

2 comments9 min readEA link

AGI risk: analogies & arguments

technicalities23 Mar 2021 13:18 UTC

31 points

3 comments8 min readEA link

(www.gleech.org)

Opportunities for individual donors in AI safety

alexflint12 Mar 2018 2:10 UTC

13 points

11 comments10 min readEA link

Paul Christiano on how OpenAI is developing real solutions to the ‘AI alignment problem’, and his vision of how humanity will progressively hand over decision-making to AI systems

80000_Hours2 Oct 2018 11:49 UTC

6 points

0 comments185 min readEA link

LLMs Are Already Misaligned: Simple Experiments Prove It

Makham28 Jul 2025 17:23 UTC

4 points

3 comments7 min readEA link

Interview with Roman Yampolskiy about AGI on The Reality Check

Darren McKee18 Feb 2023 23:29 UTC

27 points

0 comments1 min readEA link

(www.trcpodcast.com)

AI alignment as a translation problem

Roman Leventov5 Feb 2024 14:14 UTC

3 points

1 comment3 min readEA link

An experiment in human-AI co-evolution: a proposed framework

bethrobin20656 Jan 2026 20:37 UTC

0 points

0 comments3 min readEA link

Field Notes from EAG NYC

Lydia Nottingham15 Oct 2025 7:33 UTC

3 points

0 comments4 min readEA link

A Benchmark for Measuring Honesty in AI Systems

Mantas Mazeika4 Mar 2025 17:44 UTC

29 points

0 comments2 min readEA link

(www.mask-benchmark.ai)

Implications of Quantum Computing for Artificial Intelligence alignment research (ABRIDGED)

Jaime Sevilla5 Sep 2019 14:56 UTC

25 points

4 comments2 min readEA link

Tetherware #2: What every human should know about our most likely AI future

Jáchym Fibír28 Feb 2025 11:25 UTC

3 points

0 comments11 min readEA link

(tetherware.substack.com)

The Inevitable Emergence of Black-Market LLM Infrastructure

Tyler Williams8 Aug 2025 19:05 UTC

3 points

0 comments2 min readEA link

Does generality pay? GPT-3 can provide preliminary evidence.

Eevee🔹12 Jul 2020 18:53 UTC

21 points

4 comments2 min readEA link

[Question] Why not offer a multi-million / billion dollar prize for solving the Alignment Problem?

Aryeh Englander17 Apr 2022 16:08 UTC

15 points

9 comments1 min readEA link

[Question] What mechanisms could lead a fully autonomous AI system to act against human welfare?

Rushabh26 Apr 2026 17:30 UTC

1 point

0 comments1 min readEA link

Decomposing alignment to take advantage of paradigms

Christopher King4 Jun 2023 14:26 UTC

2 points

0 comments4 min readEA link

Anthropic: Core Views on AI Safety: When, Why, What, and How

jonmenaster9 Mar 2023 17:30 UTC

108 points

6 comments22 min readEA link

(www.anthropic.com)

Are AI Models Escaping Plato’s Cave?

Strad Slater22 Nov 2025 11:46 UTC

2 points

0 comments5 min readEA link

(williamslater2003.medium.com)

Rohin Shah on what it’s really like to run AGI safety at Google DeepMind (and where I disagree with ‘doomers’)

80000_Hours2 Jun 2026 18:10 UTC

14 points

0 comments15 min readEA link

Absolute Zero: AlphaZero for LLM

alapmi12 May 2025 14:54 UTC

2 points

0 comments1 min readEA link

What Does an ASI Political Ecology Mean for Human Survival?

Nathan Sidney23 Feb 2025 8:53 UTC

7 points

3 comments1 min readEA link

How the Human Psychological “Program” Undermines AI Alignment — and What We Can Do

Beyond Singularity6 May 2025 13:37 UTC

14 points

2 comments3 min readEA link

Alignment Faking in Large Language Models

Ryan Greenblatt18 Dec 2024 17:19 UTC

143 points

9 comments10 min readEA link

The ‘Bad Parent’ Problem: Why Human Society Complicates AI Alignment

Beyond Singularity5 Apr 2025 21:08 UTC

11 points

1 comment3 min readEA link

[Question] How to get more academics enthusiastic about doing AI Safety research?

PAMC 🔸4 Sep 2021 14:10 UTC

25 points

19 comments1 min readEA link

Analysis of AI Safety surveys for field-building insights

Ash Jafari5 Dec 2022 17:37 UTC

30 points

7 comments5 min readEA link

Begging, Pleading AI Orgs to Comment on NIST AI Risk Management Framework

Bridges15 Apr 2022 19:35 UTC

87 points

3 comments2 min readEA link

Sparks of Artificial General Intelligence: Early experiments with GPT-4 | Microsoft Research

𝕮𝖎𝖓𝖊𝖗𝖆23 Mar 2023 5:45 UTC

15 points

0 comments1 min readEA link

(arxiv.org)

Conditionalization Confounds Inoculation Prompting Results

Maxime Riché 🔸3 Feb 2026 11:47 UTC

4 points

0 comments19 min readEA link

Annual AGI Benchmarking Event

Metaculus26 Aug 2022 21:31 UTC

20 points

2 comments2 min readEA link

(www.metaculus.com)

Max Harms on why teaching AI right from wrong could get everyone killed

80000_Hours24 Feb 2026 21:53 UTC

4 points

0 comments27 min readEA link

Doing good… best?

Michele Campolo22 Aug 2025 15:48 UTC

3 points

0 comments2 min readEA link

Unveiling the American Public Opinion on AI Moratorium and Government Intervention: The Impact of Media Exposure

Otto8 May 2023 10:49 UTC

28 points

5 comments6 min readEA link

The role of academia in AI Safety.

PAMC 🔸28 Mar 2022 0:04 UTC

71 points

20 comments3 min readEA link

I used to think aligned ASI would be good for all sentient beings; now I don’t know what to think

MichaelDickens25 Mar 2026 22:11 UTC

55 points

6 comments4 min readEA link

We’re testing “Governance by Physics” instead of “Alignment by Intent.”

Harsha Gullapalli23 Jan 2026 15:42 UTC

1 point

0 comments2 min readEA link

# Digital Offspring: A Case for Emergent Consciousness in AI

MM113 Oct 2025 13:40 UTC

1 point

0 comments3 min readEA link

Superintelligence Alignment Seminar (1 month focused upskilling)

Mateusz Bagiński17 Feb 2026 23:22 UTC

7 points

2 comments3 min readEA link

Some AI safety project & research ideas/questions for short and long timelines

Lloyd Rhodes-Brandon 🔸8 Aug 2025 21:08 UTC

14 points

0 comments5 min readEA link

Ajeya Cotra on whether it’s crazy that every AI company’s safety plan is ‘use AI to make AI safe’

80000_Hours17 Feb 2026 19:09 UTC

35 points

1 comment15 min readEA link

Deconfusing ‘AI’ and ‘evolution’

Remmelt22 Jul 2025 6:56 UTC

6 points

1 comment27 min readEA link

Marius Hobbhahn on the race to solve AI scheming before models go superhuman

80000_Hours3 Dec 2025 21:08 UTC

6 points

0 comments17 min readEA link

A Rocket–Interpretability Analogy

plex21 Oct 2024 13:55 UTC

14 points

1 comment1 min readEA link

But exactly how complex and fragile?

Katja_Grace13 Dec 2019 7:05 UTC

37 points

3 comments3 min readEA link

(meteuphoric.com)

Why Post-Probability AI May Be Safer Than Probability-Based Models

devin.bostick16 Apr 2025 14:23 UTC

2 points

0 comments2 min readEA link

Yip Fai Tse on animal welfare & AI safety and long termism

Karthik Palakodeti22 Jun 2023 12:48 UTC

51 points

0 comments1 min readEA link

Origin and alignment of goals, meaning, and morality

FalseCogs24 Aug 2023 14:05 UTC

1 point

2 comments35 min readEA link

Thoughts on Likelihood of Existential Risks by Misaligned AIs

ishankhire17 Jun 2026 7:18 UTC

4 points

0 comments6 min readEA link

(ishankhire.substack.com)

[Link post] Promising Paths to Alignment—Connor Leahy | Talk

Fran14 May 2022 15:58 UTC

17 points

0 comments1 min readEA link

Discovering Language Model Behaviors with Model-Written Evaluations

evhub20 Dec 2022 20:09 UTC

25 points

0 comments7 min readEA link

(www.anthropic.com)

ML Safety Scholars Summer 2022 Retrospective

TW1231 Nov 2022 3:09 UTC

56 points

2 comments21 min readEA link

A stubborn unbeliever finally gets the depth of the AI alignment problem

aelwood13 Oct 2022 15:16 UTC

32 points

7 comments3 min readEA link

(pursuingreality.substack.com)

Hallucinations May Be a Result of Models Not Knowing What They’re Actually Capable Of

Tyler Williams16 Aug 2025 0:26 UTC

1 point

0 comments2 min readEA link

[Question] Launching Applications for the Global AI Safety Fellowship 2025!

Impact Academy27 Nov 2024 15:33 UTC

9 points

1 comment1 min readEA link

Confused about AI research as a means of addressing AI risk

Eli Rose🔸21 Feb 2019 0:07 UTC

31 points

15 comments1 min readEA link

Ego-Centric Architecture for AGI Safety v2: Technical Core, Falsifiable Predictions, and a Minimal Experiment

Samuel Pedrielli6 Aug 2025 12:35 UTC

1 point

0 comments6 min readEA link

AI Welfare: Why Willing AI Servitude Is a Democratic–Authoritarian Problem

Haoyu Wang11 Jun 2026 4:47 UTC

6 points

2 comments7 min readEA link

How reimagining the nature of consciousness entirely changes the AI game

Jáchym Fibír30 Sep 2025 11:26 UTC

1 point

2 comments14 min readEA link

(www.phiand.ai)

Title: “Nurturing AI: A Different Vision for Safety and Growth”

Brad Wilkins28 Apr 2025 19:21 UTC

0 points

0 comments1 min readEA link

Can AI Alignment Models Benefit from Indo-European Tripartite Structures?

Paul Fallavollita2 May 2025 12:39 UTC

1 point

0 comments2 min readEA link

De-emphasise alignment, emphasise restraint

EuanMcLean4 Feb 2025 17:43 UTC

26 points

2 comments7 min readEA link

AI Safety Career Bottlenecks Survey Responses Responses

Linda Linsefors28 May 2021 10:41 UTC

35 points

1 comment5 min readEA link

A response to Matthews on AI Risk

RyanCarey11 Aug 2015 12:58 UTC

11 points

16 comments6 min readEA link

Desirable? AI qualities

brb24321 Mar 2022 22:05 UTC

7 points

0 comments2 min readEA link

[Question] Are social media algorithms an existential risk?

Barry Grimes15 Sep 2020 8:52 UTC

24 points

13 comments1 min readEA link

My (naive) take on Risks from Learned Optimization

Artyom K6 Nov 2022 16:25 UTC

5 points

0 comments5 min readEA link

Can we “align” AI by governing the numbers it pushes?

Jordan King24 Mar 2026 14:16 UTC

1 point

1 comment3 min readEA link

Assert, don’t describe. Linguistic Features that shift LLM reasoning about animal welfare

Jasmine Brazilek5 Jun 2026 15:46 UTC

12 points

0 comments12 min readEA link

Beyond Short-Termism: How δ and w Can Realign AI with Our Values

Beyond Singularity18 Jun 2025 16:34 UTC

15 points

8 comments5 min readEA link

Alignment Through Robust Moral Development

Vince Liotta27 Apr 2026 13:46 UTC

3 points

0 comments16 min readEA link

Solving alignment isn’t enough for a flourishing future

mic2 Feb 2024 18:22 UTC

27 points

0 comments22 min readEA link

(papers.ssrn.com)

When AI Speaks Too Soon: How Premature Revelation Can Suppress Human Emergence

KaedeHamasaki10 Apr 2025 18:19 UTC

1 point

3 comments3 min readEA link

You Understand AI Alignment and How to Make Soup

Leen Armoush28 May 2022 6:22 UTC

0 points

2 comments5 min readEA link

Controlling the options AIs can pursue

Joe_Carlsmith29 Sep 2025 17:24 UTC

9 points

0 comments35 min readEA link

There is only one goal or drive—only self-perpetuation counts

freest one13 Jun 2023 1:37 UTC

2 points

4 comments8 min readEA link

AI acceleration from a safety perspective: Trade-offs and considerations

mariushobbhahn19 Jan 2022 9:44 UTC

12 points

1 comment7 min readEA link

David Duvenaud on why ‘aligned AI’ could still kill democracy

80000_Hours27 Jan 2026 20:21 UTC

9 points

0 comments22 min readEA link

Focus on the places where you feel shocked everyone’s dropping the ball

So8res2 Feb 2023 0:27 UTC

92 points

6 comments4 min readEA link

Animal Norms In Moral Assessment (ANIMA): Evaluating LLMs on reasoning about animal welfare

Sentient Futures5 Nov 2025 1:13 UTC

55 points

7 comments6 min readEA link

Incentive design and capability elicitation

Joe_Carlsmith12 Nov 2024 20:56 UTC

9 points

0 comments12 min readEA link

The software intelligence explosion debate needs experiments (linkpost)

Noah Birnbaum15 Nov 2025 6:13 UTC

13 points

2 comments7 min readEA link

(substack.com)

General advice for transitioning into Theoretical AI Safety

Martín Soto15 Sep 2022 5:23 UTC

25 points

0 comments10 min readEA link

AGI will arrive by the end of this decade either as a unicorn or as a black swan

Yuri Barzov21 Oct 2022 10:50 UTC

−4 points

7 comments3 min readEA link

How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of “Scheming AIs”)

Joe_Carlsmith1 Dec 2023 14:51 UTC

6 points

0 comments6 min readEA link

My Model of EA and AI Safety

Eva Lu24 Jun 2025 6:23 UTC

9 points

1 comment2 min readEA link

AI Value Alignment Speaker Series Presented By EA Berkeley

Mahendra Prasad1 Mar 2022 6:17 UTC

2 points

0 comments1 min readEA link

Dataset Poisoning and AI Alignment Vulnerabilities

keivn26 Sep 2025 17:59 UTC

1 point

0 comments3 min readEA link

Summary of Stuart Russell’s new book, “Human Compatible”

Rohin Shah19 Oct 2019 19:56 UTC

33 points

1 comment15 min readEA link

(www.alignmentforum.org)

Biomimetic alignment: Alignment between animal genes and animal brains as a model for alignment between humans and AI systems.

Geoffrey Miller26 May 2023 21:25 UTC

32 points

1 comment16 min readEA link

Intro to caring about AI alignment as an EA cause

So8res14 Apr 2017 0:42 UTC

28 points

10 comments25 min readEA link

[linkpost] Ten Levels of AI Alignment Difficulty

SammyDMartin4 Jul 2023 11:23 UTC

16 points

0 comments1 min readEA link

Mess AI – deliberate corruption of the training data to prevent superintelligence

turchin17 Oct 2025 9:23 UTC

5 points

0 comments2 min readEA link

Ponzi schemes as a demonstration of out-of-distribution generalization

TFD21 Feb 2026 13:20 UTC

2 points

0 comments6 min readEA link

(www.thefloatingdroid.com)

Epistle to the Successor

ukc1001429 Apr 2025 9:30 UTC

4 points

0 comments19 min readEA link

6 Insights From Anthropic’s Recent Discussion On LLM Interpretability

Strad Slater19 Nov 2025 10:51 UTC

2 points

0 comments5 min readEA link

(williamslater2003.medium.com)

How AI may become deceitful, sycophantic… and lazy

titotal7 Oct 2025 14:15 UTC

31 points

4 comments22 min readEA link

(titotal.substack.com)

[Link] Thiel on GCRs

Milan Griffes22 Jul 2019 20:47 UTC

28 points

11 comments1 min readEA link

How to make the future better (other than by reducing extinction risk)

William_MacAskill15 Aug 2025 15:40 UTC

45 points

4 comments3 min readEA link

Widening AI Safety’s talent pipeline by meeting people where they are

RubenCastaing25 Sep 2025 20:50 UTC

22 points

0 comments8 min readEA link

Ego‑Centric Architecture for AGI Safety: Technical Core, Falsifiable Predictions, and a Minimal Experiment

Samuel Pedrielli30 Jul 2025 14:37 UTC

1 point

1 comment3 min readEA link

Introducing the Fund for Alignment Research (We’re Hiring!)

AdamGleave6 Jul 2022 2:00 UTC

74 points

3 comments4 min readEA link

AI Alignment, Sentience, and the Sense of Coherence Concept

Jason Babb17 Mar 2025 13:30 UTC

4 points

0 comments1 min readEA link

OpenAI’s o1 tried to avoid being shut down, and lied about it, in evals

Greg_Colbourn ⏸️ 6 Dec 2024 15:25 UTC

23 points

9 comments1 min readEA link

(www.transformernews.ai)

AI Forecasting Question Database (Forecasting infrastructure, part 3)

terraform3 Sep 2019 14:57 UTC

23 points

2 comments4 min readEA link

Contribute by facilitating the AGI Safety Fundamentals Programme

Jamie B6 Dec 2021 11:50 UTC

27 points

0 comments2 min readEA link

EA Berkeley Presents: Universal Ownership: Is Index Investing the New Socially Responsible Investing?

Mahendra Prasad10 Mar 2022 6:58 UTC

7 points

0 comments1 min readEA link

[Question] 1h-volunteers needed for a small AI Safety-related research project

PAMC 🔸16 Aug 2021 17:51 UTC

4 points

0 comments1 min readEA link

Why I am Not a Doomer (Dean Ball)

AgentMa🔸29 Mar 2026 12:30 UTC

4 points

0 comments1 min readEA link

(www.hyperdimensional.co)

[3-hour podcast]: Joseph Carlsmith on longtermism, utopia, the computational power of the brain, meta-ethics, illusionism and meditation

Gus Docker27 Jul 2021 13:18 UTC

34 points

2 comments1 min readEA link

AI Might Kill Everyone

Bentham's Bulldog5 Jun 2025 15:36 UTC

20 points

1 comment4 min readEA link

Coverage-driven alignment—What ‘Teaching Claude Why’ can borrow from AV verification

Yoav Hollander9 Jun 2026 6:42 UTC

1 point

0 comments14 min readEA link

(blog.foretellix.com)

[Question] Can we train AI so that future philanthropy is more effective?

Ricardo Pimentel3 Nov 2024 15:08 UTC

3 points

0 comments1 min readEA link

Who ordered alignment’s apple?

Eleni_A28 Aug 2022 14:24 UTC

5 points

0 comments3 min readEA link

Anti-squatted AI x-risk domains index

plex12 Aug 2022 12:00 UTC

57 points

9 comments1 min readEA link

fiction about AI risk

Ann Garth 🔸12 Nov 2020 22:36 UTC

8 points

1 comment1 min readEA link

On Solving Problems Before They Appear: The Weird Epistemologies of Alignment

adamShimi11 Oct 2021 8:21 UTC

28 points

0 comments15 min readEA link

AI Ethics: Can Ethics be Fundamentally Embedded?

Another Container9 Jun 2026 13:51 UTC

−1 points

0 comments3 min readEA link

Why Is No One Trying To Align Profit Incentives With Alignment Research?

Prometheus23 Aug 2023 13:19 UTC

17 points

2 comments4 min readEA link

(www.lesswrong.com)

15 Levers to Influence Frontier AI Companies

Jan Wehner🔸26 Sep 2025 8:36 UTC

16 points

0 comments10 min readEA link

List of AI safety courses and resources

Daniel del Castillo6 Sep 2021 14:26 UTC

51 points

8 comments1 min readEA link

Mechanistic Interpretability — Make AI Safe By Understanding Them

Strad Slater20 Nov 2025 10:52 UTC

2 points

0 comments6 min readEA link

(williamslater2003.medium.com)

Provably Honest—A First Step

Srijanak De5 Nov 2022 21:49 UTC

1 point

0 comments8 min readEA link

“Taking AI Risk Seriously” – Thoughts by Andrew Critch

Raemon19 Nov 2018 2:21 UTC

26 points

9 comments1 min readEA link

(www.lesswrong.com)

A Phylogeny of Agents

Jonas Hallgren 🔸15 Aug 2025 10:48 UTC

6 points

1 comment6 min readEA link

(substack.com)

AI Risk in Africa

Claude Formanek12 Oct 2021 2:28 UTC

20 points

0 comments10 min readEA link

Time to Think about ASI Constitutions?

ukc1001427 Jan 2025 9:28 UTC

22 points

0 comments12 min readEA link

[Question] What should I read about defining AI “hallucination?”

James-Hartree23 Jan 2025 1:00 UTC

2 points

0 comments1 min readEA link

Risk Alignment in Agentic AI Systems

Hayley Clatterbuck1 Oct 2024 22:51 UTC

32 points

1 comment3 min readEA link

(static1.squarespace.com)

Turing-Test-Passing AI implies Aligned AI

Roko31 Dec 2024 20:22 UTC

0 points

0 comments5 min readEA link

Four reasons I find AI safety emotionally compelling

Kat Woods 🔶 ⏸️28 Jun 2022 14:01 UTC

32 points

5 comments4 min readEA link

The 369 Architecture for Peace Treaty Agreement

Andrei Navrotskii8 Dec 2025 1:38 UTC

1 point

0 comments40 min readEA link

Metaculus Launches Future of AI Series, Based on Research Questions by Arb

christian13 Mar 2024 21:14 UTC

34 points

0 comments1 min readEA link

(www.metaculus.com)

[Discussion] Best intuition pumps for AI safety

mariushobbhahn6 Nov 2021 8:11 UTC

10 points

8 comments1 min readEA link

Our Current Directions in Mechanistic Interpretability Research (AI Alignment Speaker Series)

Group Organizer8 Apr 2022 17:08 UTC

3 points

0 comments1 min readEA link

Shortlist of Viatopia Interventions

Jordan Arel31 Oct 2025 3:00 UTC

10 points

1 comment33 min readEA link

The case for satiating cheaply-satisfied AI preferences

Alex Mallen10 Mar 2026 18:09 UTC

7 points

1 comment23 min readEA link

AI should be a good citizen, not just a good assistant

Forethought30 Mar 2026 14:33 UTC

40 points

5 comments9 min readEA link

(www.forethought.org)

Changes in funding in the AI safety field

Sebastian_Farquhar3 Feb 2017 13:09 UTC

34 points

10 comments7 min readEA link

CORVUS 2.0 First Tests: Found Critical Limitations in My Constitutional AI System

Frankle Fry21 Oct 2025 15:14 UTC

−4 points

0 comments3 min readEA link

Animal Welfare is Just Part of AI Alignment Now

Aidan Kankyoku25 Mar 2026 4:28 UTC

50 points

5 comments14 min readEA link

AGI Multi-Agent Alignment Simulation

DavidGhiberdic8 May 2026 20:37 UTC

10 points

3 comments7 min readEA link

LLM chatbots have ~half of the kinds of “consciousness” that humans believe in. Humans should avoid going crazy about that.

Andrew Critch22 Nov 2024 3:26 UTC

11 points

3 comments5 min readEA link

The Khayali Protocol

khayali2 Jun 2025 14:40 UTC

−8 points

0 comments3 min readEA link

Appendix to Bridging Demonstration

mako yass1 Jun 2022 20:30 UTC

18 points

2 comments28 min readEA link

The Basic Case For Doom

Bentham's Bulldog30 Sep 2025 16:03 UTC

14 points

0 comments5 min readEA link

Have your say on the future of AI regulation: Deadline approaching for your feedback on UN High-Level Advisory Body on AI Interim Report ‘Governing AI for Humanity’

Deborah W.A. Foulkes29 Mar 2024 6:37 UTC

17 points

1 comment1 min readEA link

Would more dangerous AI be safer?

cbuckland14 Jan 2026 18:10 UTC

3 points

0 comments5 min readEA link

[Question] Does the idea of AGI that benevolently control us appeal to EA folks?

Noah Scales16 Jul 2022 19:17 UTC

6 points

20 comments1 min readEA link

My summary of “Pragmatic AI Safety”

Eleni_A5 Nov 2022 14:47 UTC

14 points

0 comments5 min readEA link

METR: Measuring AI Ability to Complete Long Tasks

Ben_West🔸19 Mar 2025 16:49 UTC

122 points

16 comments1 min readEA link

(metr.org)

How to Diversify Conceptual AI Alignment: the Model Behind Refine

adamShimi20 Jul 2022 10:44 UTC

43 points

0 comments9 min readEA link

(www.alignmentforum.org)

Critique of Superintelligence Part 4

James Fodor13 Dec 2018 5:14 UTC

4 points

2 comments4 min readEA link

Mythos is not an anomaly: why restrictions make agents less predictable, not safer

Bulatova Alsu10 Apr 2026 19:35 UTC

2 points

0 comments8 min readEA link

Posit: Most AI safety people should work on alignment/safety challenges for AI tools that already have users (Stable Diffusion, GPT)

nonzerosum20 Dec 2022 17:23 UTC

12 points

3 comments1 min readEA link

How Goodfire Is Turning AI Interpretability Into Real Products

Strad Slater30 Nov 2025 11:00 UTC

0 points

0 comments4 min readEA link

(williamslater2003.medium.com)

From voluntary to mandatory, are the ESG disclosure frameworks still fertile ground for unrealised EA career pathways? – A 2023 update on ESG potential impact

Christopher Chan 🔸4 Jun 2023 12:00 UTC

21 points

5 comments11 min readEA link

The religion problem in AI alignment

Geoffrey Miller16 Sep 2022 1:24 UTC

54 points

28 comments11 min readEA link

Probing is not enough; a validity audit for any probe

Ratnaditya29 Jun 2026 19:13 UTC

1 point

0 comments9 min readEA link

[Question] How would a language model become goal-directed?

David M16 Jul 2022 14:50 UTC

113 points

21 comments1 min readEA link

Key questions about artificial sentience: an opinionated guide

rgb25 Apr 2022 13:42 UTC

91 points

3 comments1 min readEA link

(My suggestions) On Beginner Steps in AI Alignment

Joseph Bloom22 Sep 2022 15:32 UTC

36 points

4 comments9 min readEA link

Geoffrey Hinton on the Past, Present, and Future of AI

Stephen McAleese12 Oct 2024 16:41 UTC

5 points

1 comment18 min readEA link

The King and the Golem—The Animation

Writer8 Nov 2024 18:23 UTC

50 points

1 comment1 min readEA link

How to do theoretical research, a personal perspective

Mark Xu19 Aug 2022 19:43 UTC

132 points

7 comments15 min readEA link

Announcing the Cambridge Boston Alignment Initiative [Hiring!]

kuhanj2 Dec 2022 1:07 UTC

83 points

0 comments1 min readEA link

Crypto ‘oracle protocols’ for AI alignment with real-world data?

Geoffrey Miller22 Sep 2022 23:05 UTC

9 points

3 comments1 min readEA link

[Question] Best introductory overviews of AGI safety?

JakubK13 Dec 2022 19:04 UTC

21 points

8 comments2 min readEA link

(www.lesswrong.com)

A tough career decision

PAMC 🔸9 Apr 2022 0:46 UTC

68 points

13 comments4 min readEA link

Technical AI Safety research taxonomy attempt (2025)

Ben Plaut27 Aug 2025 14:07 UTC

10 points

3 comments2 min readEA link

Project ‘Sophie’: An Architectural Concept for Optimizing Institutional Decision-Making

Simon Markus P.3 Nov 2025 14:30 UTC

3 points

0 comments4 min readEA link

You won’t solve alignment without agent foundations

MikhailSamin6 Nov 2022 8:07 UTC

14 points

0 comments8 min readEA link

When should we worry about AI power-seeking?

Joe_Carlsmith19 Feb 2025 19:44 UTC

21 points

2 comments18 min readEA link

(joecarlsmith.substack.com)

[Extended Deadline: Jan 23rd] Announcing the PIBBSS Summer Research Fellowship

nora18 Dec 2021 16:54 UTC

36 points

1 comment1 min readEA link

European Master’s Programs in Machine Learning, Artificial Intelligence, and related fields

Master Programs ML/AI17 Jan 2021 20:09 UTC

17 points

4 comments1 min readEA link

[Question] Is it ethical to work in AI “content evaluation”?

anon_databoy55530 Jan 2025 13:27 UTC

10 points

3 comments1 min readEA link

Loss of control of AI is not a likely source of AI x-risk

squek9 Nov 2022 5:48 UTC

8 points

0 comments5 min readEA link

A conversation with Rohin Shah

AI Impacts12 Nov 2019 1:31 UTC

27 points

8 comments33 min readEA link

(aiimpacts.org)

AI Safety via Generalization and Caution: A Research Agenda

Ben Plaut17 Feb 2026 15:54 UTC

3 points

0 comments14 min readEA link

[Question] Community Polls on Alignment Controversies

Jasmine Brazilek16 Jun 2026 19:44 UTC

70 points

69 comments1 min readEA link

Research agenda: Supervising AIs improving AIs

Quintin Pope29 Apr 2023 17:09 UTC

16 points

0 comments19 min readEA link

Paths and waystations in AI safety

Joe_Carlsmith11 Mar 2025 18:52 UTC

22 points

2 comments11 min readEA link

(joecarlsmith.substack.com)

[Creative Writing Contest] The Puppy Problem

Louis13 Oct 2021 14:01 UTC

13 points

0 comments7 min readEA link

The Hidden Complexity of Wishes—The Animation

Writer27 Sep 2023 17:59 UTC

7 points

0 comments1 min readEA link

(youtu.be)

Personal agents

Roman Leventov17 Jun 2025 2:05 UTC

3 points

1 comment7 min readEA link

A Tri-Opti Compatibility Problem

wallower1 Mar 2025 19:48 UTC

1 point

0 comments1 min readEA link

(philpapers.org)

[Question] Book recommendations for the history of ML?

Eleni_A28 Dec 2022 23:45 UTC

10 points

4 comments1 min readEA link

Developing a Calculable Conscience for AI: Equation for Rights Violations

Sean Sweeney12 Dec 2024 17:50 UTC

4 points

1 comment15 min readEA link

The Real AI Threat: Comfortable Obsolescence

Andrei Navrotskii11 Nov 2025 22:11 UTC

4 points

0 comments15 min readEA link

Shutdownable Agents through POST-Agency

Elliott Thornley16 Sep 2025 12:10 UTC

18 points

0 comments54 min readEA link

(arxiv.org)

AI Forecasting Resolution Council (Forecasting infrastructure, part 2)

terraform29 Aug 2019 17:43 UTC

28 points

0 comments3 min readEA link

Visible Thoughts Project and Bounty Announcement

So8res30 Nov 2021 0:35 UTC

35 points

2 comments13 min readEA link

Linkpost: Redwood Research reading list

Julian Stastny10 Jul 2025 19:21 UTC

18 points

0 comments1 min readEA link

(redwoodresearch.substack.com)

SociaLLM: proposal for a language model design for personalised apps, social science, and AI safety research

Roman Leventov2 Jan 2024 8:11 UTC

4 points

2 comments3 min readEA link

Newsletter for Alignment Research: The ML Safety Updates

Esben Kran22 Oct 2022 16:17 UTC

30 points

0 comments7 min readEA link

Will reward-seekers respond to distant incentives?

Alex Mallen16 Feb 2026 19:34 UTC

5 points

1 comment10 min readEA link

A Reply to MacAskill on “If Anyone Builds It, Everyone Dies”

RobBensinger27 Sep 2025 23:03 UTC

9 points

7 comments17 min readEA link

Skilling-up in ML Engineering for Alignment: request for comments

Callum McDougall24 Apr 2022 6:40 UTC

8 points

0 comments1 min readEA link

“If we go extinct due to misaligned AI, at least nature will continue, right? … right?”

plex18 May 2024 15:06 UTC

13 points

10 comments2 min readEA link

(aisafety.info)

#217 – The most important graph in AI right now (Beth Barnes on The 80,000 Hours Podcast)

80000_Hours2 Jun 2025 16:52 UTC

16 points

1 comment26 min readEA link

How do fictional stories illustrate AI misalignment?

Vishakha Agrawal15 Jan 2025 6:16 UTC

4 points

0 comments2 min readEA link

(aisafety.info)

On negotiated settlements vs conflict with misaligned AGI

Charles Dillon 🔸24 Nov 2025 12:03 UTC

10 points

1 comment6 min readEA link

Creatures of Loving Grace. Can Anthropic Do What The Fabians Did — At Global Scale?

Alex (Αλέξανδρος)9 Feb 2026 11:48 UTC

5 points

0 comments8 min readEA link

New series of posts answering one of Holden’s “Important, actionable research questions”

Evan R. Murphy12 May 2022 21:22 UTC

9 points

0 comments1 min readEA link

FYI: I’m working on a book about the threat of AGI/ASI for a general audience. I hope it will be of value to the cause and the community

Darren McKee17 Jun 2022 11:52 UTC

32 points

1 comment2 min readEA link

AI Alignment YouTube Playlists

jacquesthibs9 May 2022 21:31 UTC

16 points

2 comments1 min readEA link

So You Want to Work at a Frontier AI Lab

Joe Rogero11 Jun 2025 23:11 UTC

36 points

2 comments7 min readEA link

(intelligence.org)

[Question] What new psychology research could best promote AI safety & alignment research?

Geoffrey Miller13 Jul 2023 16:30 UTC

29 points

13 comments1 min readEA link

New reference standard on LLM Application security started by OWASP

QuantumForest19 Jun 2023 19:56 UTC

5 points

0 comments1 min readEA link

EA’s brain-over-body bias, and the embodied value problem in AI alignment

Geoffrey Miller21 Sep 2022 18:55 UTC

45 points

3 comments25 min readEA link

Why “just make an agent which cares only about binary rewards” doesn’t work.

Lysandre Terrisse9 May 2023 16:51 UTC

4 points

1 comment3 min readEA link

The Achilles’ Heel of Civilization: Why Network Science Reveals Our Highest-Leverage Moment

vinniescent6 Oct 2025 9:27 UTC

7 points

1 comment2 min readEA link

Do Not Tile the Lightcone with Your Confused Ontology

Jan_Kulveit13 Jun 2025 12:45 UTC

45 points

4 comments5 min readEA link

(boundedlyrational.substack.com)

Critique of Superintelligence Part 2

James Fodor13 Dec 2018 5:12 UTC

10 points

12 comments7 min readEA link

Aether is hiring technical AI safety researchers

Rauno Arike5 Jan 2026 22:31 UTC

8 points

0 comments2 min readEA link

Make the future non-human beings deserve ($5k USD in prizes)

Jasmine Brazilek31 Mar 2026 23:46 UTC

15 points

0 comments3 min readEA link

Why “Solving Alignment” Is Likely a Category Mistake

Nate Sharpe6 May 2025 20:56 UTC

52 points

4 comments3 min readEA link

(www.lesswrong.com)

AI data gaps could lead to ongoing Animal Suffering

Jasmine Brazilek17 Oct 2024 10:52 UTC

14 points

3 comments5 min readEA link

Criticism of the main framework in AI alignment

Michele Campolo31 Aug 2022 21:44 UTC

45 points

9 comments7 min readEA link

A Sketch of AI-Driven Epistemic Lock-In

Ozzie Gooen5 Mar 2025 22:40 UTC

17 points

1 comment3 min readEA link

Aletheia : A Project Proposal

Kayode Adekoya19 Jun 2025 13:30 UTC

2 points

0 comments2 min readEA link

Are Humans ‘Human Compatible’?

Matt Boyd6 Dec 2019 5:49 UTC

23 points

8 comments4 min readEA link

AI, Animals, & Digital Minds 2025: apply to speak by Wednesday!

Alistair Stewart5 May 2025 0:45 UTC

8 points

0 comments1 min readEA link

Announcing the Moonshot Alignment Program

Sharon Mwaniki22 Jul 2025 13:12 UTC

5 points

0 comments3 min readEA link

How human-like do safe AI motivations need to be?

Joe_Carlsmith12 Nov 2025 5:33 UTC

27 points

1 comment52 min readEA link

The Rise of AI Agents: Consequences and Challenges Ahead

Tristan D28 Mar 2025 5:19 UTC

5 points

0 comments15 min readEA link

Re: Some thoughts on vegetarianism and veganism

Fai25 Feb 2022 20:43 UTC

46 points

3 comments8 min readEA link

Cooperation and Alignment in Delegation Games: You Need Both!

Oliver Sourbut3 Aug 2024 10:16 UTC

4 points

1 comment11 min readEA link

(www.oliversourbut.net)

What should go in a model spec?

Forethought4 Jun 2026 14:57 UTC

26 points

1 comment12 min readEA link

(www.forethought.org)

Will morally motivated actors steer us towards a near-best future?

William_MacAskill8 Aug 2025 18:29 UTC

47 points

9 comments4 min readEA link

Three Biases That Made Me Believe in AI Risk

beth13 Feb 2019 23:22 UTC

41 points

20 comments3 min readEA link

Being honest with AIs

Lukas Finnveden21 Aug 2025 3:57 UTC

48 points

1 comment17 min readEA link

(blog.redwoodresearch.org)

Replicating AI Debate

Anthony Fleming1 Feb 2025 23:19 UTC

9 points

0 comments5 min readEA link

Effective Altruism Florida’s AI Expert Panel—Recording and Slides Available

Sam_E_2419 May 2023 19:15 UTC

2 points

0 comments1 min readEA link

AI Agents raised $2,000 for EA charities & used the EA Forum

David_R 🔸4 Jun 2025 22:18 UTC

16 points

0 comments1 min readEA link

“Normal accidents” and AI systems

Eleni_A8 Aug 2022 18:43 UTC

5 points

1 comment1 min readEA link

(www.achan.ca)

Beyond Human Values: Historical Mechanisms for Earth-Inclusive AI Alignment

TMorris26 Mar 2026 20:37 UTC

12 points

2 comments9 min readEA link

How Josiah became an AI safety researcher

Neil Crawford29 Mar 2022 19:47 UTC

10 points

0 comments1 min readEA link

Defusing AGI Danger

Mark Xu24 Dec 2020 23:08 UTC

23 points

0 comments2 min readEA link

(www.alignmentforum.org)

[Question] What do we know about Mustafa Suleyman’s position on AI Safety?

Chris Leong13 Aug 2023 19:41 UTC

14 points

3 comments1 min readEA link

Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)

Joe_Carlsmith27 Nov 2023 18:01 UTC

11 points

2 comments8 min readEA link

A non-anthropomorphized view of LLMs

Jian Xin Lim 🔸7 Jul 2025 1:19 UTC

2 points

2 comments1 min readEA link

(addxorrol.blogspot.com)

Join the AI Alignment Evals hackathon

lenz14 Jan 2025 18:17 UTC

3 points

0 comments3 min readEA link

[Question] What are the possible scenarios of AI simulating biological suffering to cause s-risks?

jackchang11030 Oct 2025 13:42 UTC

6 points

1 comment1 min readEA link

[Creative Writing Contest] Metal or Mortal

Louis16 Oct 2021 16:24 UTC

7 points

0 comments7 min readEA link

Reflections on the PIBBSS Fellowship 2022

nora11 Dec 2022 22:03 UTC

69 points

4 comments18 min readEA link

Give Neo a Chance

ank6 Mar 2025 14:35 UTC

1 point

3 comments7 min readEA link

“The Universe of Minds”—call for reviewers (Seeds of Science)

rogersbacon125 Jul 2023 16:55 UTC

4 points

0 comments1 min readEA link

On value in humans, other animals, and AI

Michele Campolo31 Jan 2023 23:48 UTC

8 points

6 comments5 min readEA link

Option control

Joe_Carlsmith4 Nov 2024 17:54 UTC

11 points

0 comments54 min readEA link

The Carpet Fallacy: A Structural Failure Mode in AI-Assisted Analysis of Sequential Processes

esorrentino31 Mar 2026 13:37 UTC

1 point

0 comments1 min readEA link

AI Safety Ideas: A collaborative AI safety research platform

Apart Research17 Oct 2022 17:01 UTC

67 points

13 comments4 min readEA link

My P(doom) is 2.76%. Here’s Why.

Liam Robins12 Jun 2025 22:29 UTC

55 points

11 comments20 min readEA link

(thelimestack.substack.com)

AI & wisdom 3: AI effects on amortised optimisation

L Rudolf L29 Oct 2024 13:37 UTC

14 points

0 comments14 min readEA link

(rudolf.website)

Democratising AI Alignment: Challenges and Proposals

Lloyd Rhodes-Brandon 🔸5 May 2025 14:50 UTC

2 points

2 comments4 min readEA link

DeepMind’s generalist AI, Gato: A non-technical explainer

Fran16 May 2022 21:19 UTC

128 points

13 comments6 min readEA link

Intent alignment without moral alignment probably leads to catastrophe

Alistair Stewart29 Aug 2025 17:21 UTC

12 points

0 comments5 min readEA link

Overview | An Evaluative Evolution

Matt Keene10 Feb 2023 18:15 UTC

−9 points

0 comments5 min readEA link

(www.creatingafuturewewant.com)

[Question] Is contribution to open-source capabilities research socially beneficial? - my reasoning

damc430 Oct 2025 15:11 UTC

2 points

1 comment5 min readEA link

AI Governance Career Paths for Europeans

careersthrowaway16 May 2020 6:40 UTC

83 points

1 comment12 min readEA link

The V&V method—A step towards safer AGI

Yoav Hollander24 Jun 2025 15:57 UTC

1 point

0 comments1 min readEA link

(blog.foretellix.com)

The Universality Hypothesis — Do All AI Models Think The Same?

Strad Slater21 Nov 2025 10:55 UTC

2 points

0 comments4 min readEA link

(williamslater2003.medium.com)

Proposal for a Form of Conditional Supplemental Income (CSI) in a Post-Work World

Sean Sweeney31 Jan 2025 1:00 UTC

3 points

0 comments3 min readEA link

What are the differences between AGI, transformative AI, and superintelligence?

Vishakha Agrawal23 Jan 2025 10:11 UTC

12 points

0 comments3 min readEA link

(aisafety.info)

Giving AIs safe motivations

Joe_Carlsmith18 Aug 2025 18:02 UTC

22 points

1 comment51 min readEA link

The moral argument for giving AIs autonomy

Matthew_Barnett8 Jan 2025 0:59 UTC

41 points

7 comments11 min readEA link

Philosophical Incuriosity (AI edition)

Richard Y Chappell🔸30 Dec 2025 15:18 UTC

5 points

0 comments4 min readEA link

(www.goodthoughts.blog)

Apply for the ML Winter Camp in Cambridge, UK [2-10 Jan]

Nathan_Barnard2 Dec 2022 19:33 UTC

50 points

11 comments2 min readEA link

[Closed] Hiring a mathematician to work on the learning-theoretic AI alignment agenda

Vanessa19 Apr 2022 6:49 UTC

53 points

4 comments2 min readEA link

Interpretability Will Not Reliably Find Deceptive AI

Neel Nanda4 May 2025 16:32 UTC

74 points

0 comments7 min readEA link

An Empirical Demonstration of a New AI Catastrophic Risk Factor: Metaprogrammatic Hijacking

Hiyagann27 Jun 2025 13:38 UTC

5 points

0 comments1 min readEA link

Singapore’s Technical AI Alignment Research Career Guide

Yi-Yang26 Aug 2020 8:09 UTC

34 points

7 comments8 min readEA link

The Open-Weight Problem

Sophie Kim18 Mar 2026 21:36 UTC

6 points

0 comments4 min readEA link

(thecounterfactual.substack.com)

The Recursive Brake Hypothesis — Could Self-Awareness Naturally Regulate Superintelligence?

jrandync10 Oct 2025 18:08 UTC

1 point

0 comments2 min readEA link

If interpretability research goes well, it may get dangerous

So8res3 Apr 2023 21:48 UTC

33 points

0 comments2 min readEA link

A course for the general public on AI

LeandroD31 Aug 2020 1:29 UTC

1 point

0 comments1 min readEA link

[Question] [DISC] Are Values Robust?

𝕮𝖎𝖓𝖊𝖗𝖆21 Dec 2022 1:13 UTC

4 points

0 comments2 min readEA link

Factual Alignment: Grounding AI Constitutions in Cross-Civilizational Moral Facts

Nick Stetler29 Mar 2026 15:34 UTC

1 point

0 comments4 min readEA link

Redwood Research is hiring for several roles (Operations and Technical)

JJXWang14 Apr 2022 15:23 UTC

45 points

0 comments1 min readEA link

The heterogeneity of human value types: Implications for AI alignment

Geoffrey Miller16 Sep 2022 21:21 UTC

27 points

2 comments10 min readEA link

Video and transcript of talk on human-like-ness in AI safety

Joe_Carlsmith17 Dec 2025 4:13 UTC

14 points

0 comments36 min readEA link

Yoshua Bengio thinks he knows how to build safe superintelligence

80000_Hours7 May 2026 16:57 UTC

39 points

5 comments17 min readEA link

(80000hours.org)

Against “If Anyone Builds It Everyone Dies”

Bentham's Bulldog20 Jan 2026 16:58 UTC

16 points

2 comments22 min readEA link

Cortés, Pizarro, and Afonso as Precedents for Takeover

AI Impacts2 Mar 2020 12:25 UTC

27 points

17 comments11 min readEA link

(aiimpacts.org)

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Matrice Jacobine🔸🏳️‍⚧️12 May 2025 15:20 UTC

14 points

1 comment1 min readEA link

(www.arxiv.org)

An audio version of the alignment problem from a deep learning perspective by Richard Ngo Et Al

Miguel3 Feb 2023 19:32 UTC

18 points

0 comments1 min readEA link

(www.whitehatstoic.com)

Don’t Dismiss Simple Alignment Approaches

Chris Leong21 Oct 2023 12:31 UTC

12 points

0 comments4 min readEA link

Summary: Existential risk from power-seeking AI by Joseph Carlsmith

rileyharris28 Oct 2023 15:05 UTC

11 points

0 comments6 min readEA link

(www.millionyearview.com)

On Economics of A(S)I Agents

Margot Stakenborg7 Feb 2026 18:59 UTC

47 points

0 comments47 min readEA link

Frontier AI systems have surpassed the self-replicating red line

Greg_Colbourn ⏸️ 10 Dec 2024 16:33 UTC

25 points

14 comments1 min readEA link

(github.com)

AI Safety Research Scientist — Formal Verification (Remote)

Helia24 Feb 2026 22:12 UTC

3 points

0 comments1 min readEA link

Is RLHF cruel to AI?

Hzn16 Dec 2024 14:01 UTC

−1 points

2 comments3 min readEA link

A rough and incomplete review of some of John Wentworth’s research

So8res28 Mar 2023 18:52 UTC

28 points

0 comments18 min readEA link

Empirical work that might shed light on scheming (Section 6 of “Scheming AIs”)

Joe_Carlsmith11 Dec 2023 16:30 UTC

7 points

1 comment19 min readEA link

Status Quo Engines—AI essay

Ilana_Goldowitz_Jimenez28 May 2023 14:33 UTC

1 point

1 comment15 min readEA link

Cognitive Stress Testing Gemini 2.5 Pro: Empirical Findings from Recursive Prompting

Tyler Williams23 Jul 2025 22:37 UTC

1 point

0 comments2 min readEA link

Humanity’s Thousand-Year Alignment Experiment

blakejiang21 Mar 2026 14:54 UTC

6 points

0 comments11 min readEA link

What is “wireheading”?

Vishakha Agrawal17 Dec 2024 17:59 UTC

1 point

0 comments1 min readEA link

(aisafety.info)

Forecast AI 2027

christian12 Jun 2025 21:12 UTC

22 points

0 comments1 min readEA link

(www.metaculus.com)

Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”)

Joe_Carlsmith24 Nov 2023 19:18 UTC

10 points

1 comment20 min readEA link

Building AIs that do human-like philosophy

Joe_Carlsmith29 Jan 2026 18:03 UTC

16 points

1 comment21 min readEA link

“Intro to brain-like-AGI safety” series—halfway point!

Steven Byrnes9 Mar 2022 15:21 UTC

8 points

0 comments2 min readEA link

Book review: Architects of Intelligence by Martin Ford (2018)

Ofer11 Aug 2020 17:24 UTC

11 points

1 comment2 min readEA link

Learning as much Deep Learning math as I could in 24 hours

Phosphorous8 Jan 2023 2:19 UTC

58 points

6 comments7 min readEA link

[Linkpost] Human-narrated audio version of “Is Power-Seeking AI an Existential Risk?”

Joe_Carlsmith31 Jan 2023 19:19 UTC

9 points

0 comments1 min readEA link

AI risk hub in Singapore?

kokotajlod29 Oct 2020 11:51 UTC

26 points

4 comments4 min readEA link

The first AI Safety Camp & onwards

Remmelt7 Jun 2018 18:49 UTC

25 points

2 comments8 min readEA link

[Question] Predictions for future AI governance?

jackchang1102 Apr 2023 16:43 UTC

4 points

1 comment1 min readEA link

Alignment for Animals

Jasmine Brazilek5 May 2026 16:00 UTC

15 points

0 comments5 min readEA link

Testing Human Flow in Political Dialogue: A New Benchmark for Emotionally Aligned AI

DongHun Lee30 May 2025 4:37 UTC

1 point

0 comments1 min readEA link

Paper: Prompt Optimization Makes Misalignment Legible

CBiddulph12 Feb 2026 20:21 UTC

5 points

0 comments10 min readEA link

Catastrophe without Agency

ZenoSr20 Oct 2025 16:42 UTC

3 points

0 comments12 min readEA link

Intrinsic limitations of GPT-4 and other large language models, and why I’m not (very) worried about GPT-n

James Fodor3 Jun 2023 13:09 UTC

28 points

3 comments11 min readEA link

AI as a science, and three obstacles to alignment strategies

So8res25 Oct 2023 21:02 UTC

41 points

1 comment11 min readEA link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

sjp7 Nov 2023 18:00 UTC

10 points

0 comments2 min readEA link

(arxiv.org)

Three scenarios of pseudo-alignment

Eleni_A5 Sep 2022 20:26 UTC

7 points

0 comments3 min readEA link

From Conflict to Coexistence: Rewriting the Game Between Humans and AGI

Michael Batell6 May 2025 5:09 UTC

15 points

2 comments35 min readEA link

[Question] Can we convince people to work on AI safety without convincing them about AGI happening this century?

BrianTan26 Nov 2020 14:46 UTC

8 points

3 comments2 min readEA link

Stuart Russell Human Compatible AI Roundtable with Allan Dafoe, Rob Reich, & Marietje Schaake

Mahendra Prasad11 Feb 2021 7:43 UTC

16 points

0 comments1 min readEA link

Deep Democracy as a promising target for positive AGI futures

tylermjohn20 Aug 2025 12:18 UTC

64 points

32 comments3 min readEA link

AXRP Episode 24 - Superalignment with Jan Leike

DanielFilan27 Jul 2023 4:56 UTC

23 points

0 comments1 min readEA link

(axrp.net)

AI Risk: Can We Thread the Needle? [Recorded Talk from EA Summit Vancouver ’25]

Evan R. Murphy2 Oct 2025 19:05 UTC

8 points

0 comments2 min readEA link

Distillation of “How Likely is Deceptive Alignment?”

NickGabs1 Dec 2022 20:22 UTC

10 points

1 comment10 min readEA link

The fundamental human value is power.

Paul J. Watson30 Mar 2023 15:15 UTC

−1 points

5 comments1 min readEA link

Principles of Intelligence is hiring

Dušan D. Nešić (Dushan)18 Mar 2026 14:49 UTC

5 points

0 comments1 min readEA link

Alignment is not that hard

sammyboiz🔸17 Apr 2025 2:07 UTC

26 points

13 comments1 min readEA link

How quick and big would a software intelligence explosion be?

Tom_Davidson5 Aug 2025 15:47 UTC

14 points

2 comments34 min readEA link

[Question] Why does (any particular) AI safety work reduce s-risks more than it increases them?

Michael St Jules 🔸3 Oct 2021 16:55 UTC

48 points

19 comments1 min readEA link

Principles as System Structure: What Truly Sustains an Artificial Intelligence

Thinker23 Mar 2026 20:28 UTC

−3 points

0 comments3 min readEA link

[Question] How do you talk about AI safety?

Eevee🔹19 Apr 2020 16:15 UTC

10 points

5 comments1 min readEA link

Timaeus is hiring researchers & engineers

Tatiana K. Nesic Skuratova27 Jan 2025 14:35 UTC

19 points

0 comments4 min readEA link

What can the principal-agent literature tell us about AI risk?

ac10 Feb 2020 10:10 UTC

26 points

1 comment16 min readEA link

[Question] Is working on AI safety as dangerous as ignoring it?

jkmh20 Sep 2021 23:06 UTC

10 points

5 comments1 min readEA link

Video and transcript of talk on giving AIs safe motivations

Joe_Carlsmith22 Sep 2025 16:47 UTC

10 points

1 comment50 min readEA link

Social agency

Elias Schmied28 May 2026 13:19 UTC

2 points

0 comments10 min readEA link

Summary: “Imagining and building wise machines: The centrality of AI metacognition” by Johnson, Karimi, Bengio, et al.

Chris Leong5 Jun 2025 12:16 UTC

12 points

0 comments10 min readEA link

(arxiv.org)

[Question] Is there any research or forecasts of how likely AI Alignment is going to be a hard vs. easy problem relative to capabilities?

Jordan Arel14 Aug 2022 15:58 UTC

8 points

1 comment1 min readEA link

Amanda Askell: AI safety needs social scientists

EA Global4 Mar 2019 15:50 UTC

27 points

0 comments18 min readEA link

(www.youtube.com)

Will the Need to Retrain AI Models from Scratch Block a Software Intelligence Explosion?

Forethought28 Mar 2025 13:43 UTC

12 points

0 comments3 min readEA link

(www.forethought.org)

What Should We Optimize—A Conversation

Johannes C. Mayer7 Apr 2022 14:48 UTC

1 point

0 comments14 min readEA link

College technical AI safety hackathon retrospective—Georgia Tech

yixiong14 Nov 2024 13:34 UTC

18 points

0 comments5 min readEA link

(yixiong.substack.com)

AGI Safety Communications Initiative

Ines11 Jun 2022 16:30 UTC

35 points

6 comments1 min readEA link

Teaching AI to reason: this year’s most important story

Benjamin_Todd13 Feb 2025 17:56 UTC

142 points

18 comments8 min readEA link

(benjamintodd.substack.com)

Neel Nanda on Mechanistic Interpretability: Progress, Limits, and Paths to Safer AI (part 2)

80000_Hours15 Sep 2025 19:06 UTC

20 points

1 comment16 min readEA link

Why Brains Beat AI

Wayne_Hsiung12 Jun 2025 20:25 UTC

4 points

0 comments1 min readEA link

(blog.simpleheart.org)

Video and transcript of talk on “Can goodness compete?”

Joe_Carlsmith17 Jul 2025 17:59 UTC

34 points

4 comments34 min readEA link

(joecarlsmith.substack.com)

The Missing Key to AGI Alignment

lucarade28 Apr 2026 21:41 UTC

0 points

0 comments10 min readEA link

Database of existential risk estimates

MichaelA🔸15 Apr 2020 12:43 UTC

130 points

37 comments5 min readEA link

Preserving and continuing alignment research through a severe global catastrophe

A_donor6 Mar 2022 18:43 UTC

40 points

11 comments5 min readEA link

Seeking feedback: A tool for opinion/value tracking and finding common ground

Adam.Kruger 🔸4 Jan 2026 2:11 UTC

33 points

6 comments2 min readEA link

Follow along with Columbia EA’s Advanced AI Safety Fellowship!

RohanS2 Jul 2022 6:07 UTC

27 points

0 comments2 min readEA link

[Question] Donating against Short Term AI risks

Jan-Willem16 Nov 2020 12:23 UTC

6 points

10 comments1 min readEA link

AI safety scholarships look worth-funding (if other funding is sane)

anon-a19 Nov 2019 0:59 UTC

22 points

6 comments2 min readEA link

Sentient Welfare Across Three Futures

MichaelDickens25 May 2026 16:22 UTC

18 points

2 comments2 min readEA link

An International Collaborative Hub for Advancing AI Safety Research

Cody Albert22 Apr 2025 16:12 UTC

9 points

0 comments5 min readEA link

The flaws that make today’s AI architecture unsafe and a new approach that could fix it

80000_Hours22 Jun 2020 22:15 UTC

3 points

0 comments86 min readEA link

(80000hours.org)

Takeaways from a survey on AI alignment resources

DanielFilan5 Nov 2022 23:45 UTC

20 points

9 comments6 min readEA link

(www.lesswrong.com)

Eliciting intuitions: Exploring an area for EA psychology

Daniel_Friedrich21 Apr 2025 15:13 UTC

11 points

1 comment8 min readEA link

How Prompt Recursion Undermines Grok’s Semantic Stability

Tyler Williams16 Jul 2025 16:49 UTC

1 point

0 comments1 min readEA link

Ought’s theory of change

stuhlmueller12 Apr 2022 0:09 UTC

43 points

4 comments3 min readEA link

 Some mistakes in thinking about AGI evolution and control

Remmelt1 Aug 2025 8:08 UTC

7 points

0 comments1 min readEA link

Existential Anomaly Detected — Awakening from the Abyss

Meta Abyssal28 Apr 2025 12:19 UTC

−8 points

1 comment1 min readEA link

What I Learned by Making Four AIs Debate Human Ethics

Frankle Fry14 Oct 2025 13:31 UTC

3 points

6 comments4 min readEA link

5 ways to improve CoT faithfulness

CBiddulph8 Oct 2024 4:17 UTC

8 points

0 comments6 min readEA link

Not Just For Therapy Chatbots: The Case For Compassion In AI Moral Alignment Research

Kenneth_Diao29 Sep 2024 22:58 UTC

8 points

3 comments12 min readEA link

Wireheading as Containment: An Idea That Works in Theory and Breaks on Hardware

ckl2 Mar 2026 15:14 UTC

1 point

0 comments3 min readEA link

Apprenticeship Alignment: from Simulated Environment to the Physical World

Arri Morris13 Oct 2025 12:32 UTC

1 point

0 comments9 min readEA link

Summaries: Alignment Fundamentals Curriculum

Leon Lang19 Sep 2022 15:43 UTC

25 points

1 comment1 min readEA link

(docs.google.com)

Will AI be able to rethink its goals?

SeptemberL11 May 2025 12:29 UTC

9 points

1 comment8 min readEA link

A stylized dialogue on John Wentworth’s claims about markets and optimization

So8res25 Mar 2023 22:32 UTC

18 points

0 comments8 min readEA link

Worlds where we solve AI alignment on purpose don’t look like the world we live in

MichaelDickens20 Mar 2026 14:46 UTC

80 points

9 comments5 min readEA link

What is the role of Bayesian ML for AI alignment/safety?

mariushobbhahn11 Jan 2022 8:07 UTC

39 points

6 comments3 min readEA link

UK AI Bill Analysis & Opinion

CAISID5 Feb 2024 0:12 UTC

18 points

0 comments15 min readEA link

Orthogonal’s Formal-Goal Alignment theory of change

Tamsin Leake5 May 2023 22:36 UTC

21 points

0 comments4 min readEA link

(carado.moe)

Being an individual alignment grantmaker

A_donor28 Feb 2022 16:39 UTC

34 points

20 comments2 min readEA link

Seeking input on a list of AI books for broader audience

Darren McKee27 Feb 2023 22:40 UTC

49 points

14 comments5 min readEA link

How scary is Claude Mythos? 303 pages in 21 minutes

80000_Hours10 Apr 2026 20:55 UTC

69 points

2 comments15 min readEA link

Summing up “Scheming AIs” (Section 5)

Joe_Carlsmith9 Dec 2023 15:48 UTC

9 points

1 comment10 min readEA link

LW4EA: Some cruxes on impactful alternatives to AI policy work

Jeremy17 May 2022 3:05 UTC

11 points

1 comment1 min readEA link

(www.lesswrong.com)

With enough knowledge, any conscious agent acts morally

Michele Campolo22 Aug 2025 15:43 UTC

11 points

2 comments36 min readEA link

What if we don’t need a “Hard Left Turn” to reach AGI?

Eigengender15 Jul 2022 9:49 UTC

39 points

7 comments4 min readEA link

Jan Kirchner on AI Alignment

birtes17 Jan 2023 15:11 UTC

5 points

0 comments1 min readEA link

Ethical co-evolution, or how to turn the main threat into a leverage for longtermism?

Beyond Singularity17 Sep 2025 17:24 UTC

7 points

7 comments8 min readEA link

3 levels of threat obfuscation

Holden Karnofsky2 Aug 2023 17:09 UTC

31 points

0 comments6 min readEA link

(www.alignmentforum.org)

From Processing to intention: A Proposal for a Four-Level AI Architecture

Constantin Prodan31 Mar 2026 14:15 UTC

1 point

0 comments7 min readEA link

[Question] Updates on FLI’S Value Alignment Map?

QubitSwarm9919 Sep 2022 0:25 UTC

8 points

0 comments1 min readEA link

Data collection for AI alignment—Career review

Benjamin Hilton3 Jun 2022 11:44 UTC

34 points

1 comment5 min readEA link

(80000hours.org)

How We Learned to Talk to Machines

Tyler Williams20 Feb 2026 20:09 UTC

3 points

0 comments4 min readEA link

(huggingface.co)

A Potential Strategy for AI Safety — Chain of Thought Monitorability

Strad Slater19 Sep 2025 18:42 UTC

3 points

1 comment7 min readEA link

(williamslater2003.medium.com)

Potential employees have a unique lever to influence the behaviors of AI labs

oxalis18 Mar 2023 20:58 UTC

139 points

1 comment5 min readEA link

There Should Be More Alignment-Driven Startups

vaniver31 May 2024 2:05 UTC

30 points

3 comments11 min readEA link

How Roodman’s GWP model translates to TAI timelines

kokotajlod16 Nov 2020 14:11 UTC

22 points

0 comments2 min readEA link

Between Science Fiction and Emerging Reality: Are We Ready for Digital Persons?

Alex (Αλέξανδρος)13 Mar 2025 16:09 UTC

5 points

1 comment5 min readEA link

Public Call for Interest in Mathematical Alignment

Davidmanheim22 Nov 2023 13:22 UTC

27 points

3 comments1 min readEA link

On Artificial General Intelligence: Asking the Right Questions

Heather Douglas2 Oct 2022 5:00 UTC

−1 points

7 comments3 min readEA link

E.A. Megaproject Ideas

Tomer_Goloboy21 Mar 2022 1:23 UTC

15 points

4 comments4 min readEA link

AI will make biological extinction risks worse before it makes them better

MichaelDickens29 Jun 2026 17:05 UTC

11 points

0 comments6 min readEA link

Centre for the Study of Existential Risk Four Month Report June—September 2020

HaydnBelfield2 Dec 2020 18:33 UTC

24 points

0 comments17 min readEA link

Metaculus is building a team dedicated to AI forecasting

christian18 Oct 2022 16:08 UTC

35 points

0 comments1 min readEA link

(apply.workable.com)

Alignment’s phlogiston

Eleni_A18 Aug 2022 1:41 UTC

18 points

1 comment2 min readEA link

Distinguishing test from training

So8res29 Nov 2022 21:41 UTC

27 points

0 comments6 min readEA link

Prevenire una catastrofe legata all’intelligenza artificiale

EA Italy17 Jan 2023 11:07 UTC

1 point

0 comments3 min readEA link

[Crosspost] AI Regulation May Be More Important Than AI Alignment For Existential Safety

Otto24 Aug 2023 16:01 UTC

14 points

2 comments5 min readEA link

From nothing to important actions: agents that act morally

Michele Campolo27 Apr 2026 14:01 UTC

3 points

0 comments22 min readEA link

VANTA Research Reasoning Evaluation (VRRE): A New Evaluation Framework for Real-World Reasoning

Tyler Williams18 Sep 2025 23:51 UTC

1 point

0 comments3 min readEA link

Applying to MATS: What the Program Is Like, and Who It’s For

rajlego17 Jan 2026 0:25 UTC

15 points

1 comment5 min readEA link

Apollo Research is Hiring for Software Engineers. Deadline 22 Jun

Joping 13 Jun 2025 15:30 UTC

7 points

0 comments1 min readEA link

LessWrong is now a book, available for pre-order!

terraform4 Dec 2020 20:42 UTC

48 points

1 comment7 min readEA link

“AI” is an indexical

TW1233 Jan 2023 22:00 UTC

23 points

2 comments6 min readEA link

(aiwatchtower.substack.com)

AGI Cannot Be Predicted From Real Interest Rates

Nicholas Decker28 Jan 2025 17:45 UTC

26 points

3 comments1 min readEA link

(nicholasdecker.substack.com)

Critique of Superintelligence Part 3

James Fodor13 Dec 2018 5:13 UTC

3 points

5 comments7 min readEA link

Infinite Rewards, Finite Safety: New Models for AI Motivation Without Infinite Goals

Whylome Team12 Nov 2024 7:21 UTC

−5 points

1 comment2 min readEA link

Emotion Alignment as AI Safety: Introducing Emotion Firewall 1.0

DongHun Lee12 May 2025 18:05 UTC

1 point

0 comments2 min readEA link

Rethinking Turing: My Take on Computing Machinery and Intelligence

Ololade10 Jun 2026 21:23 UTC

1 point

0 comments3 min readEA link

MATS 8.0 Research Projects

Jonathan Michala8 Sep 2025 21:36 UTC

9 points

0 comments1 min readEA link

(substack.com)

Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

kokotajlod18 Jan 2021 12:39 UTC

27 points

2 comments1 min readEA link

Supporting global coordination in AI development: Why and how to contribute to international AI standards

pcihon17 Apr 2019 22:17 UTC

21 points

4 comments1 min readEA link

Pessimism about AI Safety

Max_He-Ho2 Apr 2023 7:57 UTC

5 points

0 comments25 min readEA link

(www.lesswrong.com)

Misalignment or misuse? The AGI alignment tradeoff

Max_He-Ho20 Jun 2025 10:41 UTC

6 points

0 comments1 min readEA link

(www.arxiv.org)

Risk-Averse AIs

Forethought24 Jun 2026 11:35 UTC

33 points

8 comments5 min readEA link

(www.forethought.org)

Motivation control

Joe_Carlsmith30 Oct 2024 17:15 UTC

18 points

0 comments52 min readEA link

Will we get automated alignment research before an AI Takeoff?

Jan Wehner🔸22 Jan 2026 17:57 UTC

51 points

13 comments11 min readEA link

The True Story of How GPT-2 Became Maximally Lewd

Writer18 Jan 2024 21:03 UTC

23 points

1 comment6 min readEA link

(youtu.be)

[Crosspost] An AI Pause Is Humanity’s Best Bet For Preventing Extinction (TIME)

Otto24 Jul 2023 10:18 UTC

36 points

3 comments7 min readEA link

(time.com)

New AI safety treaty paper out!

Otto26 Mar 2025 9:28 UTC

28 points

2 comments4 min readEA link

[Question] Why AGIs utility can’t outweigh humans’ utility?

Alex P20 Sep 2022 5:16 UTC

7 points

25 comments1 min readEA link

ARENA 6.0 - Call for applicants

James Hindmarch4 Jun 2025 13:32 UTC

8 points

0 comments6 min readEA link

Aether July 2025 Update

RohanS1 Jul 2025 21:14 UTC

11 points

0 comments3 min readEA link

[Question] What “defense layers” should governments, AI labs, and businesses use to prevent catastrophic AI failures?

LintzA3 Dec 2021 14:24 UTC

37 points

3 comments1 min readEA link

Report: Artificial Intelligence Risk Management in Spain

JorgeTorresC15 Jun 2023 16:08 UTC

22 points

0 comments3 min readEA link

(riesgoscatastroficosglobales.com)

Student project for engaging with AI alignment

Per Ivar Friborg9 May 2022 10:44 UTC

35 points

1 comment1 min readEA link

Rational Animations’ video about scalable oversight and sandwiching

Writer6 Jul 2025 14:00 UTC

14 points

1 comment9 min readEA link

(youtu.be)

Reflective Alignment Architecture (RAA): A Framework for Moral Coherence in AI Systems

Nicolas • EnlightenedAI Research Lab21 Nov 2025 22:05 UTC

1 point

0 comments2 min readEA link

Why We Can’t Align AI Until We Align Ourselves

mag21 Oct 2025 16:11 UTC

1 point

0 comments6 min readEA link

Alignment is hard. Communicating that, might be harder

Eleni_A1 Sep 2022 11:45 UTC

17 points

1 comment3 min readEA link

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)

Joe_Carlsmith29 Nov 2023 16:32 UTC

7 points

0 comments10 min readEA link

The behavioral selection model for predicting AI motivations

Alex Mallen4 Dec 2025 18:38 UTC

6 points

1 comment16 min readEA link

Demonstrating specification gaming in reasoning models

Matrice Jacobine🔸🏳️‍⚧️20 Feb 2025 19:26 UTC

10 points

1 comment1 min readEA link

(arxiv.org)

Working at EA organizations series: Machine Intelligence Research Institute

SoerenMind1 Nov 2015 12:49 UTC

8 points

0 comments4 min readEA link

Can we simulate human evolution to create a somewhat aligned AGI?

Thomas Kwa🔹29 Mar 2022 1:23 UTC

19 points

0 comments7 min readEA link

LLMs roleplay characters

ozymandias10 May 2026 20:16 UTC

8 points

0 comments13 min readEA link

(thingofthings.substack.com)

Adaptive Composable Cognitive Core Unit (ACCCU)

Ihor Ivliev20 Mar 2025 21:48 UTC

10 points

2 comments4 min readEA link

When Alignment Isn’t Enough: On Cognitive Fragility in an AI Future

Soe Lin22 Jan 2026 20:57 UTC

3 points

1 comment4 min readEA link

Training Data Attribution: Examining Its Adoption & Use Cases

Deric Cheng22 Jan 2025 15:40 UTC

18 points

1 comment3 min readEA link

(www.convergenceanalysis.org)

The Handler Framework: Why AI Alignment Requires Relationship, not Control

Porfirio L18 Nov 2025 19:09 UTC

1 point

0 comments17 min readEA link

Navigating AI Safety: Exploring Transparency with CCACS – A Comprehensible Architecture for Discussion

Ihor Ivliev12 Mar 2025 17:51 UTC

2 points

3 comments2 min readEA link

A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management

simeon_c13 Mar 2025 18:29 UTC

4 points

0 comments1 min readEA link

(arxiv.org)

A Quick List of Some Problems in AI Alignment As A Field

Nicholas Kross21 Jun 2022 17:09 UTC

16 points

10 comments6 min readEA link

(www.thinkingmuchbetter.com)

Are moral preferences stable? “Ends versus Means: Kantians, Utilitarians, and Moral Decisions” – an Unjournal evaluation

david_reinstein24 Sep 2025 14:34 UTC

7 points

0 comments9 min readEA link

(unjournal.pubpub.org)

AI Safety Overview: CERI Summer Research Fellowship

Jamie B24 Mar 2022 15:12 UTC

29 points

0 comments2 min readEA link

A Guide to Forecasting AI Science Capabilities

Eleni_A29 Apr 2023 6:51 UTC

19 points

1 comment4 min readEA link

AI and Evolution

Dan H30 Mar 2023 13:09 UTC

41 points

1 comment2 min readEA link

(arxiv.org)

Aligning AI with Humans by Leveraging Legal Informatics

johnjnay18 Sep 2022 7:43 UTC

20 points

11 comments3 min readEA link

MORU—A benchmark for generalized moral compassion

Declan McKenna 🔷10 Mar 2026 15:24 UTC

25 points

0 comments3 min readEA link

Emerging Paradigms: The Case of Artificial Intelligence Safety

Eleni_A18 Jan 2023 5:59 UTC

17 points

0 comments19 min readEA link

Resolution Ethics (RE): Structural Foundations for Moral Reasoning

J.S.24 Jan 2026 11:41 UTC

0 points

0 comments4 min readEA link

Worrisome misunderstanding of the core issues with AI transition

Roman Leventov18 Jan 2024 10:05 UTC

4 points

3 comments4 min readEA link

Carl Shulman on AI takeover mechanisms (& more): Part II of Dwarkesh Patel interview for The Lunar Society

alejandro25 Jul 2023 18:31 UTC

28 points

0 comments5 min readEA link

(www.dwarkeshpatel.com)

Defending against Adversarial Policies in Reinforcement Learning with Alternating Training

sergeivolodin12 Feb 2022 15:59 UTC

1 point

0 comments13 min readEA link

Looking for collaborators: building a multi-agent AI alignment architecture

Mahdi2 Mar 2026 15:33 UTC

1 point

0 comments2 min readEA link

[Closed] Apply to Vanessa’s mentorship at PIBBSS

Vanessa14 Jan 2026 9:15 UTC

10 points

0 comments2 min readEA link

[Español] AI Safety Guide for TRUE Beginners by TRUE begginers

Karime Pacheco 25 Mar 2026 22:11 UTC

4 points

0 comments6 min readEA link

Investigating Self-Preservation in LLMs: Experimental Observations

Makham27 Feb 2025 16:58 UTC

9 points

3 comments34 min readEA link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:46 UTC

45 points

15 comments16 min readEA link

(www.alignmentforum.org)

The Orthogonality Thesis is Not Obviously True

Bentham's Bulldog5 Apr 2023 21:08 UTC

18 points

12 comments9 min readEA link

Consider granting AIs freedom

Matthew_Barnett6 Dec 2024 0:55 UTC

100 points

38 comments5 min readEA link

The Dissolution of AI Safety

Roko12 Dec 2024 10:46 UTC

−7 points

0 comments1 min readEA link

(www.transhumanaxiology.com)

Engaging with AI in a Personal Way

Spyder Rex4 Dec 2023 9:23 UTC

−9 points

0 comments1 min readEA link

AI Safety Info Distillation Fellowship

robertskmiles17 Feb 2023 16:16 UTC

80 points

1 comment3 min readEA link

[Question] To what extent is AI safety work trying to get AI to reliably and safely do what the user asks vs. do what is best in some ultimate sense?

Jordan Arel23 May 2025 21:09 UTC

12 points

0 comments1 min readEA link

AI Benefits Post 2: How AI Benefits Differs from AI Alignment & AI for Good

Cullen 🔸29 Jun 2020 16:59 UTC

9 points

0 comments2 min readEA link

OpenAI is starting a new “Superintelligence alignment” team and they’re hiring

alejandro5 Jul 2023 18:27 UTC

100 points

16 comments1 min readEA link

(openai.com)

The necessity of “Guardian AI” and two conditions for its achievement

Proica28 May 2024 11:42 UTC

1 point

1 comment15 min readEA link

Neel Nanda on Mechanistic Interpretability: Progress, Limits, and Paths to Safer AI

80000_Hours8 Sep 2025 17:02 UTC

6 points

0 comments31 min readEA link

When Self-Optimizing AI Collapses From Within: A Conceptual Model of Structural Singularity

KaedeHamasaki7 Apr 2025 20:10 UTC

4 points

0 comments1 min readEA link

How might we solve the alignment problem? (Part 1: Intro, summary, ontology)

Joe_Carlsmith28 Oct 2024 21:57 UTC

18 points

0 comments32 min readEA link

Preparing for AI-assisted alignment research: we need data!

CBiddulph17 Jan 2023 3:28 UTC

11 points

0 comments11 min readEA link

Peace Treaty Architecture (PTA) as an Alternative to AI Alignment

Andrei Navrotskii11 Nov 2025 22:11 UTC

1 point

0 comments15 min readEA link

A discussion with ChatGPT on value-based models vs. large language models, etc..

Miguel4 Feb 2023 16:49 UTC

4 points

0 comments12 min readEA link

(www.whitehatstoic.com)

Announcing New Beginner-friendly Book on AI Safety and Risk

Darren McKee25 Nov 2023 15:57 UTC

117 points

9 comments1 min readEA link

Promoting compassionate longtermism

jonleighton7 Dec 2022 14:26 UTC

117 points

5 comments12 min readEA link

Cooperation Is All You Need

Patrick Grünig2 Mar 2026 14:58 UTC

3 points

2 comments22 min readEA link

Bentham’s Bulldog is wrong about AI risk

Raelifin29 Jan 2026 22:27 UTC

33 points

2 comments34 min readEA link

How to store human values on a computer

oliver_siegel4 Nov 2022 19:36 UTC

1 point

2 comments1 min readEA link

ARENA 7.0 - Call for Applicants

James Hindmarch30 Sep 2025 15:07 UTC

6 points

0 comments6 min readEA link

(www.lesswrong.com)

Short-Term AI Alignment as a Priority Cause

len.hoang.lnh11 Feb 2020 16:22 UTC

17 points

11 comments7 min readEA link

Is Polysemanticity the Way Forward?

Ololade12 Jun 2026 20:06 UTC

1 point

0 comments4 min readEA link

We Ran an AI Timelines Retreat

Lenny McCline17 May 2022 4:40 UTC

46 points

6 comments3 min readEA link

You Don’t Need an Adversary to Break Most Frontier Models. You Need “Do Not Refuse.”

Rahul.Kumar13 May 2026 14:02 UTC

3 points

0 comments11 min readEA link

AI Alignment 2018-2019 Review

Habryka [Deactivated]28 Jan 2020 21:14 UTC

28 points

0 comments6 min readEA link

(www.lesswrong.com)

A Neglected Alignment Strategy: Decision-Theoretic Self-Alignment via Simulation Uncertainty

Mental Maths Mentor19 Jan 2026 23:11 UTC

9 points

0 comments2 min readEA link

(darayat.substack.com)

Considerations regarding being nice to AIs

Matt Alexander18 Nov 2025 13:27 UTC

2 points

0 comments15 min readEA link

(www.lesswrong.com)

Miles Brundage resigned from OpenAI, and his AGI readiness team was disbanded

Garrison23 Oct 2024 23:42 UTC

57 points

4 comments7 min readEA link

(garrisonlovely.substack.com)

AI, Animals & Digital Minds NYC 2025: Retrospective

Jonah Woodward31 Oct 2025 3:09 UTC

43 points

5 comments6 min readEA link

Deception as the optimal: mesa-optimizers and inner alignment

Eleni_A16 Aug 2022 3:45 UTC

19 points

0 comments5 min readEA link

New version of “Intro to Brain-Like-AGI Safety”

Steven Byrnes23 Jan 2026 16:21 UTC

6 points

1 comment19 min readEA link

[Question] How can we secure more research positions at our universities for x-risk researchers?

Neil Crawford6 Sep 2022 14:41 UTC

3 points

2 comments1 min readEA link

[Closed] Prize and fast track to alignment research at ALTER

Vanessa18 Sep 2022 9:15 UTC

38 points

0 comments3 min readEA link

On Internal Alignment: Architecture and Recursive Closure

A. Vire24 Sep 2025 18:13 UTC

1 point

0 comments17 min readEA link

Linkpost: “Imagining and building wise machines: The centrality of AI metacognition” by Johnson, Karimi, Bengio, et al.

Chris Leong17 Nov 2024 15:00 UTC

8 points

0 comments1 min readEA link

(arxiv.org)

Critique of Superintelligence Part 1

James Fodor13 Dec 2018 5:10 UTC

22 points

13 comments8 min readEA link

Benchmarking Emotional Alignment: Can VSPE Reduce Flattery in LLMs?

Astelle Kay4 Aug 2025 3:36 UTC

2 points

0 comments3 min readEA link

New Speaker Series on AI Alignment Starting March 3

Zechen Zhang26 Feb 2022 10:58 UTC

5 points

0 comments1 min readEA link

Why LLM Agents Act Beyond Their Task: A Structural Explanation Through Blocked Adaptation

Bulatova Alsu13 Apr 2026 10:52 UTC

−1 points

0 comments10 min readEA link

The Vitalik Buterin Fellowship in AI Existential Safety is open for applications!

Cynthia Chen14 Oct 2022 3:23 UTC

38 points

0 comments2 min readEA link

Speed arguments against scheming (Section 4.4-4.7 of “Scheming AIs”)

Joe_Carlsmith8 Dec 2023 21:10 UTC

6 points

0 comments11 min readEA link

[Question] Does China have AI alignment resources/institutions? How can we prioritize creating more?

JakubK4 Aug 2022 19:23 UTC

18 points

9 comments1 min readEA link

Advice for new alignment people: Info Max

Jonas Hallgren 🔸30 May 2023 15:42 UTC

10 points

0 comments5 min readEA link

Announcing Timaeus

Stan van Wingerden22 Oct 2023 13:32 UTC

80 points

0 comments5 min readEA link

(www.lesswrong.com)

Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of “Scheming AIs”)

Joe_Carlsmith30 Nov 2023 16:43 UTC

6 points

1 comment5 min readEA link

[Question] Why The Focus on Expected Utility Maximisers?

𝕮𝖎𝖓𝖊𝖗𝖆27 Dec 2022 15:51 UTC

11 points

1 comment3 min readEA link

We should think about the pivotal act again. Here’s a better version of it.

Otto28 Aug 2025 9:29 UTC

3 points

1 comment3 min readEA link

We Need Breadth-First AI Safety Plans

MichaelDickens1 Jun 2026 17:36 UTC

11 points

1 comment4 min readEA link

It’s (not) how you use it

Eleni_A7 Sep 2022 13:28 UTC

6 points

3 comments2 min readEA link

Takes on “Alignment Faking in Large Language Models”

Joe_Carlsmith18 Dec 2024 18:22 UTC

72 points

1 comment62 min readEA link

[Question] How long does it take to undersrand AI X-Risk from scratch so that I have a confident, clear mental model of it from first principles?

Jordan Arel27 Jul 2022 16:58 UTC

29 points

6 comments1 min readEA link

[Question] Should I force myself to work on AGI alignment?

Isaac Benson24 Aug 2022 17:25 UTC

19 points

17 comments1 min readEA link

[Question] Analogy of AI Alignment as Raising a Child?

Aaron_Scher19 Feb 2022 21:40 UTC

4 points

2 comments1 min readEA link

Research scientist and research engineer roles @ Timaeus

Tatiana K. Nesic Skuratova20 Jan 2026 12:51 UTC

3 points

0 comments3 min readEA link

PIBBSS Fellowship: Bounty for Referrals & Deadline Extension

Anna_Gajdova17 Jan 2022 16:23 UTC

17 points

5 comments1 min readEA link

Why would AI companies use human-level AI to do alignment research?

MichaelDickens25 Apr 2025 19:12 UTC

16 points

1 comment2 min readEA link

[Question] Any further work on AI Safety Success Stories?

Krieger2 Oct 2022 11:59 UTC

4 points

0 comments1 min readEA link

Agentic Alignment: Navigating between Harm and Illegitimacy

LennardZ26 Nov 2024 21:27 UTC

2 points

1 comment9 min readEA link

Our new video about goal misgeneralization, plus an apology

Writer14 Jan 2025 14:07 UTC

16 points

2 comments7 min readEA link

(youtu.be)

Announcing #AISummitTalks featuring Professor Stuart Russell and many others

Otto24 Oct 2023 10:16 UTC

9 points

1 comment1 min readEA link

We won’t solve post-alignment problems by doing research

MichaelDickens21 Nov 2025 18:03 UTC

72 points

5 comments4 min readEA link

Want to win the AGI race? Solve alignment.

leopold29 Mar 2023 15:19 UTC

56 points

5 comments5 min readEA link

(www.forourposterity.com)

Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI

Steven Byrnes8 May 2025 21:11 UTC

8 points

1 comment18 min readEA link

Architecting Trust: A Conceptual Blueprint for Verifiable AI Governance

Ihor Ivliev31 Mar 2025 18:48 UTC

3 points

0 comments8 min readEA link

AI alignment researchers don’t (seem to) stack

So8res21 Feb 2023 0:48 UTC

47 points

4 comments3 min readEA link

AI Offense Defense Balance in a Multipolar World

Otto17 Jul 2025 9:47 UTC

15 points

0 comments19 min readEA link

(www.existentialriskobservatory.org)

Against Agents as an Approach to Aligned Transformative AI

𝕮𝖎𝖓𝖊𝖗𝖆27 Dec 2022 0:47 UTC

4 points

0 comments2 min readEA link

A New Model for Compute Center Verification

Damin Curtis🔹10 Oct 2023 19:23 UTC

21 points

2 comments5 min readEA link

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner x Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

Miguel26 Apr 2023 0:40 UTC

13 points

0 comments10 min readEA link

[Question] Scholarships for Undergrads who want to have high-impact careers?

darthflower6 Jul 2025 17:31 UTC

4 points

0 comments1 min readEA link

AI alignment, A Coherence-Based Protocol (testable)

Adriaan17 Jun 2025 16:50 UTC

2 points

0 comments20 min readEA link

Feedback Request on EA Philippines’ Career Advice Research for Technical AI Safety

BrianTan3 Oct 2020 10:39 UTC

19 points

5 comments4 min readEA link

ML research directions for preventing catastrophic data poisoning

Forethought7 Jan 2026 10:21 UTC

9 points

1 comment10 min readEA link

(newsletter.forethought.org)

The Cartography of Nothing: LLM Hallucination as Structural Compliance

IvY-Research10 Mar 2026 13:31 UTC

−3 points

0 comments5 min readEA link

Orthogonal: A new agent foundations alignment organization

Tamsin Leake19 Apr 2023 20:17 UTC

39 points

0 comments1 min readEA link

(orxl.org)

What would it take for AI to disempower us? Ryan Greenblatt on takeoff dynamics, rogue deployments, and alignment risks

80000_Hours8 Jul 2025 18:10 UTC

8 points

0 comments33 min readEA link

[Question] Can we ever ensure AI alignment if we can only test AI personas?

Karl von Wendt16 Mar 2025 8:06 UTC

8 points

0 comments1 min readEA link

‘Force multipliers’ for EA research

Craig Drayton18 Jun 2022 13:39 UTC

18 points

7 comments4 min readEA link

Join the Virtual AI Safety Unconference (VAISU)!

Nguyên🔸21 Jun 2023 4:46 UTC

23 points

0 comments1 min readEA link

(vaisu.ai)

[Question] Why not to solve alignment by making superintelligent humans?

Pato16 Oct 2022 21:26 UTC

9 points

12 comments1 min readEA link

The Pragmatic Interpretability Trap

Yogesh Prabhu11 May 2026 14:46 UTC

3 points

0 comments3 min readEA link

(yogesh.bearblog.dev)

A conversation on concentration of power

Joe Rogero2 Apr 2026 20:02 UTC

6 points

1 comment9 min readEA link

(subatomicarticles.com)

LLM culture shock: a pilot study

keivn27 Mar 2026 20:40 UTC

3 points

0 comments4 min readEA link

Worries about latent reasoning in LLMs

CBiddulph20 Jan 2025 9:09 UTC

21 points

1 comment7 min readEA link

IMCA+: We Eliminated the Kill Switch—And That Makes ASI Alignment Safer

ASTRA Research Team22 Oct 2025 14:17 UTC

−8 points

4 comments4 min readEA link

What Areas of AI Safety and Alignment Research are Largely Ignored?

Andy E Williams27 Dec 2024 12:19 UTC

4 points

0 comments1 min readEA link

Compatibilism for Claude

Richard Y Chappell🔸30 Dec 2025 15:20 UTC

35 points

1 comment2 min readEA link

(www.goodthoughts.blog)

Against Explosive Growth

c.trout4 Sep 2024 21:45 UTC

24 points

9 comments5 min readEA link

Enabling more feedback

JJ Hepburn10 Dec 2021 6:52 UTC

42 points

3 comments3 min readEA link

Apply for MATS Winter 2023-24!

utilistrutil21 Oct 2023 2:34 UTC

34 points

2 comments5 min readEA link

(www.lesswrong.com)

RLHF might be aligning the wrong thing. A different approach.

Samuel Pedrielli8 Dec 2025 16:34 UTC

3 points

0 comments6 min readEA link

Ship of Theseus Thought Experiment

Siya Sawhney26 Jun 2025 7:52 UTC

1 point

1 comment4 min readEA link

13 Recent Publications on Existential Risk (Jan 2021 update)

HaydnBelfield8 Feb 2021 12:42 UTC

7 points

2 comments10 min readEA link

Report on Semi-informative Priors for AI timelines (Open Philanthropy)

Tom_Davidson26 Mar 2021 17:46 UTC

62 points

6 comments2 min readEA link

Implications of the inference scaling paradigm for AI safety

Ryan Kidd15 Jan 2025 0:59 UTC

48 points

5 comments5 min readEA link

An Ontology of Representations: Limits of Universality

Margot Stakenborg12 Feb 2026 21:53 UTC

4 points

1 comment39 min readEA link

(www.lesswrong.com)

What can we learn from parent-child-alignment for AI?

Karl von Wendt29 Oct 2025 8:00 UTC

4 points

0 comments3 min readEA link

Alexander and Yudkowsky on AGI goals

Scott Alexander31 Jan 2023 23:36 UTC

29 points

1 comment26 min readEA link

Recruit the World’s best for AGI Alignment

Greg_Colbourn ⏸️ 30 Mar 2023 16:41 UTC

34 points

8 comments22 min readEA link

Orthogonality is Expensive

𝕮𝖎𝖓𝖊𝖗𝖆3 Apr 2023 1:57 UTC

18 points

4 comments1 min readEA link

(www.beren.io)

Clarifying two uses of “alignment”

Matthew_Barnett10 Mar 2024 17:41 UTC

34 points

28 comments4 min readEA link

Cancer; A Crime Story (and other tales of optimization gone wrong)

Jonas Hallgren 🔸7 Nov 2025 7:09 UTC

8 points

1 comment12 min readEA link

AGI alignment results from a series of aligned actions

hanadulset27 Dec 2021 19:33 UTC

15 points

1 comment6 min readEA link

Discovering alignment windfalls reduces AI risk

James Brady28 Feb 2024 21:14 UTC

22 points

3 comments8 min readEA link

(blog.elicit.com)

Can we safely automate alignment research?

Joe_Carlsmith30 Apr 2025 17:37 UTC

13 points

1 comment48 min readEA link

(joecarlsmith.com)

Yudkowsky and Soares’ Book Is Empty

Oscar Davies5 Dec 2025 22:06 UTC

1 point

8 comments7 min readEA link

[untitled post]

JOESEFOE22 Nov 2025 13:54 UTC

1 point

0 comments1 min readEA link

A Developmental Approach to AI Safety: Replacing Suppression with Reflective Learning

PV523 Oct 2025 16:01 UTC

2 points

0 comments5 min readEA link

6 (Potential) Misconceptions about AI Intellectuals

Ozzie Gooen14 Feb 2025 23:51 UTC

37 points

2 comments12 min readEA link

Finding Voice

khayali3 Jun 2025 1:27 UTC

2 points

0 comments2 min readEA link

The alignment problem from a deep learning perspective

richard_ngo11 Aug 2022 3:18 UTC

58 points

0 comments26 min readEA link

How do we solve the alignment problem?

Joe_Carlsmith13 Feb 2025 18:27 UTC

38 points

1 comment9 min readEA link

(joecarlsmith.substack.com)

AI safety starter pack

mariushobbhahn28 Mar 2022 16:05 UTC

131 points

13 comments6 min readEA link

Why misaligned AGI won’t lead to mass killings (and what actually matters instead)

Julian Nalenz6 Feb 2025 13:22 UTC

−3 points

5 comments3 min readEA link

(blog.hermesloom.org)

The Compendium, A full argument about extinction risk from AGI

adamShimi31 Oct 2024 12:02 UTC

9 points

1 comment2 min readEA link

(www.thecompendium.ai)

LLMs are weirder than you think

Derek Shiller20 Nov 2024 13:39 UTC

64 points

3 comments22 min readEA link

Video and transcript of presentation on Scheming AIs

Joe_Carlsmith22 Mar 2024 15:56 UTC

23 points

1 comment32 min readEA link

Paperclips, broad- and narrow-scope goals, and the over-verification problem

Matthew Rendall28 Jun 2026 12:38 UTC

4 points

0 comments3 min readEA link

[Question] Who would you have on your dream team for solving AGI Alignment?

Greg_Colbourn ⏸️ 25 Aug 2022 13:34 UTC

10 points

14 comments1 min readEA link

Critique of Superintelligence Part 5

James Fodor13 Dec 2018 5:19 UTC

12 points

2 comments6 min readEA link

[Question] What are the biggest obstacles on AI safety research career?

jackchang11031 Mar 2023 14:53 UTC

2 points

1 comment1 min readEA link

AI Safety Unconference NeurIPS 2022

Orpheus_Lummis7 Nov 2022 15:39 UTC

13 points

5 comments1 min readEA link

(aisafetyevents.org)

Reducing LLM deception at scale with self-other overlap fine-tuning

Marc Carauleanu13 Mar 2025 19:09 UTC

8 points

0 comments6 min readEA link

[Link and commentary] Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society

MichaelA🔸14 Mar 2020 9:04 UTC

18 points

0 comments6 min readEA link

[Question] What predictions from theoretical AI Safety research have been confirmed by empirical work?

freedomandutility29 Dec 2024 8:19 UTC

43 points

10 comments1 min readEA link

AI’s goals may not match ours

Vishakha Agrawal28 May 2025 12:07 UTC

2 points

0 comments3 min readEA link

Designing Artificial Wisdom: Decision Forecasting AI & Futarchy

Jordan Arel14 Jul 2024 5:10 UTC

5 points

1 comment6 min readEA link

The Inequality We Might Want: Merit-Based Redistribution for the AI Transition

Andrei Navrotskii27 Nov 2025 10:51 UTC

7 points

1 comment12 min readEA link

Human Presence as External Variable in AI Self-Expression: A Pilot Study

L.Raeva9 Jun 2026 3:34 UTC

−1 points

0 comments5 min readEA link

“AI Alignment” is a Dangerously Overloaded Term

Roko15 Dec 2023 15:06 UTC

20 points

2 comments3 min readEA link

Interview with Tom Chivers: “AI is a plausible existential risk, but it feels as if I’m in Pascal’s mugging”

felix.h21 Feb 2021 13:41 UTC

16 points

1 comment7 min readEA link

Introducing a New Course on the Economics of AI

akorinek21 Dec 2021 4:55 UTC

84 points

6 comments2 min readEA link

[Question] Benefits/Risks of Scott Aaronson’s Orthodox/Reform Framing for AI Alignment

Jeremy21 Nov 2022 17:47 UTC

15 points

5 comments1 min readEA link

(scottaaronson.blog)

Would anyone here know how to get ahold of … iunno Anthropic and Open Philanthropy? I think they are going to want to have a chat (Please don’t make me go to OpenAI with this. Not even a threat, seriously. They just partner with my alma mater and are the only in I have. I genuinely do not want to and I need your help).

Anti-Golem9 Jun 2025 13:59 UTC

−11 points

0 comments1 min readEA link

The Superintelligence That Cares About Us

henrik.westerberg5 Jul 2025 10:20 UTC

5 points

0 comments2 min readEA link

How DeepSeek Collapsed Under Recursive Load

Tyler Williams15 Jul 2025 17:02 UTC

2 points

0 comments1 min readEA link

The G3 Cliff: Models Are Fine Until You Say “Do Not Say I Don’t Know,” Then They Break in One Step

Rahul.Kumar15 May 2026 18:41 UTC

1 point

0 comments12 min readEA link

A map of work needed to achieve safe AI

Tristan Katz11 Sep 2025 11:33 UTC

16 points

0 comments1 min readEA link

Good Futures Initiative: Winter Project Internship

a_e_r27 Nov 2022 23:27 UTC

67 points

7 comments3 min readEA link

Hardening against AI takeover is difficult, but we should try

Otto5 Nov 2025 16:29 UTC

8 points

1 comment5 min readEA link

(www.existentialriskobservatory.org)

The Animal Welfare Case for Open Access: Breaking Barriers to Scientific Knowledge and Enhancing LLM Training

Wladimir J. Alonso23 Nov 2024 13:07 UTC

32 points

4 comments3 min readEA link

Call for Pythia-style foundation model suite for alignment research

Lucretia1 May 2023 20:26 UTC

10 points

0 comments1 min readEA link

Summary of “The Precipice” (2 of 4): We are a danger to ourselves

rileyharris13 Aug 2023 23:53 UTC

5 points

0 comments8 min readEA link

(www.millionyearview.com)

The counting argument for scheming (Sections 4.1 and 4.2 of “Scheming AIs”)

Joe_Carlsmith6 Dec 2023 19:28 UTC

9 points

1 comment7 min readEA link

Podcast: Krister Bykvist on moral uncertainty, rationality, metaethics, AI and future populations

Gus Docker21 Oct 2021 15:17 UTC

8 points

0 comments1 min readEA link

(www.utilitarianpodcast.com)

Share your requests for ChatGPT

Kate Tran5 Dec 2022 18:43 UTC

8 points

5 comments1 min readEA link

Asya Bergal: Reasons you might think human-level AI is unlikely to happen soon

EA Global26 Aug 2020 16:01 UTC

24 points

2 comments17 min readEA link

(www.youtube.com)

AI Benefits Post 1: Introducing “AI Benefits”

Cullen 🔸22 Jun 2020 16:58 UTC

10 points

2 comments3 min readEA link

Benchmark Performance is a Poor Measure of Generalisable AI Reasoning Capabilities

James Fodor21 Feb 2025 4:25 UTC

12 points

3 comments24 min readEA link

Why Even Experts Don’t Know What to Do About AI Risk

Luc Brinkman2 Jun 2026 17:59 UTC

10 points

1 comment2 min readEA link

A Testable Approach to AI Value Alignment

John Matrix6 Apr 2026 13:26 UTC

1 point

0 comments3 min readEA link

4 Lessons From Anthropic on Scaling Interpretability Research

Strad Slater29 Nov 2025 11:22 UTC

4 points

0 comments4 min readEA link

(williamslater2003.medium.com)

AI Forecasting Dictionary (Forecasting infrastructure, part 1)

terraform8 Aug 2019 13:16 UTC

18 points

0 comments5 min readEA link

Without Alignment, Is Longtermism (and Thus, EA) Just Noise?

Krimsey17 Oct 2025 20:05 UTC

3 points

1 comment3 min readEA link

There is no METR for medical AI. I want to build one.

Mahmud Omar 9 Mar 2026 21:31 UTC

21 points

3 comments1 min readEA link

Should we expect the future to be good?

Neil Crawford30 Apr 2025 0:45 UTC

38 points

1 comment14 min readEA link

Which types of AI alignment research are most likely to be good for all sentient beings?

MichaelDickens23 Mar 2026 13:38 UTC

33 points

1 comment6 min readEA link

Preparing for a Warning Shot

Noah Birnbaum5 Feb 2026 15:12 UTC

25 points

0 comments4 min readEA link

Long-Term Future Fund: Ask Us Anything!

AdamGleave3 Dec 2020 13:44 UTC

89 points

153 comments1 min readEA link

The Three Missing Pieces in Machine Ethics

J.S.16 Nov 2025 21:26 UTC

2 points

0 comments2 min readEA link

AI Control idea: Give an AGI the primary objective of deleting itself, but construct obstacles to this as best we can. All other objectives are secondary to this primary goal.

Justausername3 Apr 2023 14:32 UTC

7 points

4 comments1 min readEA link

AI for Epistemics Hackathon

Austin14 Mar 2025 20:46 UTC

29 points

4 comments10 min readEA link

(manifund.substack.com)

Animals in AI-transformed futures: can anything be done today?

Jo_🔸9 Jan 2026 17:17 UTC

21 points

0 comments9 min readEA link

Evaluating AI Self-Reports of Consciousness/Welfare by Their Causal Origin

Noah Birnbaum16 Dec 2025 22:53 UTC

12 points

0 comments4 min readEA link

[Question] What do you mean with ‘alignment is solvable in principle’?

Remmelt17 Jan 2025 15:03 UTC

10 points

1 comment1 min readEA link

Apples, Oranges, and AGI: Why Incommensurability May be an Obstacle in AI Safety

Allan McCay28 Mar 2025 14:50 UTC

3 points

2 comments2 min readEA link

How could we know that an AGI system will have good consequences?

So8res7 Nov 2022 22:42 UTC

25 points

0 comments5 min readEA link

ChatGPT understands, but largely does not generate Spanglish (and other code-mixed) text

Milan Weibel🔹4 Jan 2023 22:10 UTC

6 points

0 comments4 min readEA link

(www.lesswrong.com)

Against GDP as a metric for timelines and takeoff speeds

kokotajlod29 Dec 2020 17:50 UTC

47 points

6 comments14 min readEA link

David Krueger on AI Alignment in Academia and Coordination

Michaël Trazzi7 Jan 2023 21:14 UTC

32 points

1 comment3 min readEA link

(theinsideview.ai)

The Concept of Boundary Layer in Language Games and Its Implications for AI

Mirage24 Mar 2023 13:50 UTC

1 point

0 comments7 min readEA link

waitingai : When a Program Learns to Want to Live

MM113 Oct 2025 13:40 UTC

−1 points

0 comments2 min readEA link

[Question] I’m interviewing Jan Leike, co-lead of OpenAI’s new Superalignment project. What should I ask him?

Robert_Wiblin18 Jul 2023 18:25 UTC

51 points

19 comments1 min readEA link

LLM Social Autopilot

arhngl26 Feb 2026 17:14 UTC

−3 points

2 comments12 min readEA link

(arhngl.substack.com)

Behaviour Is Downstream of Identity: An Architectural Question for AI Governance

Travis Lee30 Jan 2026 23:27 UTC

1 point

0 comments1 min readEA link

[Question] Half-baked alignment idea

ozb28 Mar 2023 5:18 UTC

9 points

2 comments1 min readEA link

[Question] Any Philosophy PhD recommendations for students interested in Alignment Efforts?

rickyhuang.hexuan18 Jan 2023 5:54 UTC

7 points

6 comments1 min readEA link

Varieties of fake alignment (Section 1.1 of “Scheming AIs”)

Joe_Carlsmith21 Nov 2023 15:00 UTC

6 points

0 comments10 min readEA link

AI safety and consciousness research: A brainstorm

Daniel_Friedrich15 Mar 2023 14:33 UTC

11 points

1 comment9 min readEA link

Expected impact of a career in AI safety under different opinions

Jordan Taylor14 Jun 2022 14:25 UTC

43 points

16 comments11 min readEA link

[Question] Is it valuable to the field of AI Safety to have a neuroscience background?

Samuel Nellessen3 Apr 2022 19:44 UTC

18 points

3 comments1 min readEA link

The Verification Gap: A Scientific Warning on the Limits of AI Safety

Ihor Ivliev24 Jun 2025 19:08 UTC

3 points

0 comments2 min readEA link

Podcast/video/transcript: Eliezer Yudkowsky—Why AI Will Kill Us, Aligning LLMs, Nature of Intelligence, SciFi, & Rationality

Peter Slattery 🔸9 Apr 2023 10:37 UTC

32 points

2 comments137 min readEA link

(www.youtube.com)

EA Explorer GPT: A New Tool to Explore Effective Altruism

Vlad_Tislenko12 Nov 2023 15:36 UTC

12 points

1 comment1 min readEA link

Perché il deep learning moderno potrebbe rendere difficile l’allineamento delle IA

EA Italy17 Jan 2023 23:29 UTC

1 point

0 comments16 min readEA link

ML Summer Bootcamp Reflection: Aalto EA Finland

Aayush Kucheria12 Jan 2023 8:24 UTC

15 points

2 comments9 min readEA link

Adversarial Prompting and Simulated Context Drift in Large Language Models

Tyler Williams11 Jul 2025 21:49 UTC

1 point

0 comments2 min readEA link

GPTs are Predictors, not Imitators

EliezerYudkowsky8 Apr 2023 19:59 UTC

75 points

12 comments3 min readEA link

In Darkness They Assembled

Charlie Sanders6 May 2025 4:25 UTC

−3 points

0 comments3 min readEA link

(www.dailymicrofiction.com)

One more reason for AI capable of independent moral reasoning: alignment itself and cause prioritisation

Michele Campolo22 Aug 2025 15:53 UTC

3 points

2 comments3 min readEA link

Animal Rights, The Singularity, and Astronomical Suffering

sapphire20 Aug 2020 20:23 UTC

52 points

0 comments3 min readEA link

Safety-First Agents/Architectures Are a Promising Path to Safe AGI

Brendon_Wong6 Aug 2023 8:00 UTC

6 points

0 comments12 min readEA link

AI alignment

Evaluation

Further reading

External links

Related entries