[Question] What predictions from theoretical AI Safety research have been confirmed by empirical work?

freedomandutilityDec 29, 2024, 8:19 AM

43 points

Cause prioritization AI safety Existential risk Threads AI alignment AI risk skepticism

I’m trying to better understand whether recent efforts to give AI Safety research a better empirical grounding have produced evidence that some claims based on theoretical AI Safety work have turned out to be correct.

This could make me update in favour of taking AI Safety concerns more seriously.

Previously I have been skeptical of AI Safety arguments due to many claims being based on theoretical reasoning rather than empirical evidence.

freedomandutilityDec 29, 2024, 8:19 AM

43 points

10 comments1 min readEA link

Cause prioritization AI safety Existential risk Threads AI alignment AI risk skepticism

Larks Dec 30, 2024, 1:30 AM
19 points
6 ∶ 1

That people will choose to let the AI out of the box.
Habryka [Deactivated]Dec 30, 2024, 8:50 AM
18 points
5 ∶ 0

Developing “contextual awareness” does not require some special grounding insight (i.e. training systems to be general purpose problem solvers naturally causes them to optimize themselves and their environment and become aware of their context, etc.). This was back in 2020, 2021, 2022 one of the recurring disagreements between me and many ML people.
Larks Dec 30, 2024, 1:40 AM
15 points
3 ∶ 1

That sufficiently intelligent AIs will not ‘automatically’ be moral (e.g. the behaviour of un-RLHF’d models).
Larks Dec 30, 2024, 1:42 AM
13 points
2 ∶ 1

That mindspace is large and AIs are really weird.
- Guive Jan 4, 2025, 5:04 AM
  2 points
  0 ∶ 0
  Parent
  
  What specific confirmatory evidence are you thinking of?
  - Larks Jan 4, 2025, 5:46 AM
    2 points
    0 ∶ 0
    Parent
    
    Conversations people have with un-RLHF’d models.
Habryka [Deactivated]Dec 30, 2024, 8:53 AM
6 points
2 ∶ 0

AI systems modeling their own training process is a pretty big deal for modeling what AIs will end up caring about, and how well you can control them (cf. the latest Anthropic paper)
Habryka [Deactivated]Dec 30, 2024, 8:52 AM
6 points
2 ∶ 0

For most cognitive tasks, there does not seem to be a particularly fundamental threshold at human-level performance (this one is still out in many ways, but we are seeing more evidence for this on an ongoing basis as we reach superhuman performance on many measures)
Larks Dec 30, 2024, 1:40 AM
5 points
3 ∶ 1

That it is possible to make smarter than human AIs and that this is The Issue.
Davidmanheim Dec 29, 2024, 7:46 PM
1 point
0 ∶ 0

Not predictions as such, but lots of current work on AI safety and steering is based pretty directly on paradigms from Yudkowsky and Christiano—from Anthropic’s constitutional AI to ARIA’s Safeguarded AI program. There is also OpenAI’s Superalignment reserach, which was attempting to build AI that could solve agent foundations—that is, explicitly do the work that theoretical AI safety research identified. (I’m unclear whether the last is ongoing or not, given that they managed to alienate most of the people involved.)

No comments.