I’m trying to better understand whether recent efforts to give AI Safety research a better empirical grounding have produced evidence that some claims based on theoretical AI Safety work have turned out to be correct.
This could make me update in favour of taking AI Safety concerns more seriously.
Previously I have been skeptical of AI Safety arguments due to many claims being based on theoretical reasoning rather than empirical evidence.
That people will choose to let the AI out of the box.
That mindspace is large and AIs are really weird.
What specific confirmatory evidence are you thinking of?
Conversations people have with un-RLHF’d models.
Developing “contextual awareness” does not require some special grounding insight (i.e. training systems to be general purpose problem solvers naturally causes them to optimize themselves and their environment and become aware of their context, etc.). This was back in 2020, 2021, 2022 one of the recurring disagreements between me and many ML people.
That sufficiently intelligent AIs will not ‘automatically’ be moral (e.g. the behaviour of un-RLHF’d models).
AI systems modeling their own training process is a pretty big deal for modeling what AIs will end up caring about, and how well you can control them (cf. the latest Anthropic paper)
For most cognitive tasks, there does not seem to be a particularly fundamental threshold at human-level performance (this one is still out in many ways, but we are seeing more evidence for this on an ongoing basis as we reach superhuman performance on many measures)
That it is possible to make smarter than human AIs and that this is The Issue.
Not predictions as such, but lots of current work on AI safety and steering is based pretty directly on paradigms from Yudkowsky and Christiano—from Anthropic’s constitutional AI to ARIA’s Safeguarded AI program. There is also OpenAI’s Superalignment reserach, which was attempting to build AI that could solve agent foundations—that is, explicitly do the work that theoretical AI safety research identified. (I’m unclear whether the last is ongoing or not, given that they managed to alienate most of the people involved.)