[Question] What is an example of recent, tangible progress in AI safety research?

Aaron Gertler 🔸Jun 14, 2021, 5:29 AM

35 points

Cross-posting a good question from Reddit. Answer there, here, or in both places; I’ll make sure the Reddit author knows about this post.

Eric Herboso’s answer on Reddit (the only one so far) includes these examples:

Scott Garrabrant on Finite Factored Sets (May)
Paul Christiano on his Research Methodology (March)
Rob Miles on Misaligned Mesa-Optimisers (Feb part 1 May part 2, both describing a paper from 2019)

Aaron Gertler 🔸Jun 14, 2021, 5:29 AM

35 points

4 comments1 min readEA link

AI alignment AI safety

CarlShulman Jun 15, 2021, 4:11 AM
12 points
0 ∶ 0

Focusing on empirical results:

Learning to summarize from human feedback was good, for several reasons.

I liked the recent paper empirically demonstrating objective robustness failures hypothesized in earlier theoretical work on inner alignment.
- Mark Xu Jun 15, 2021, 4:50 AM
  4 points
  0 ∶ 0
  Parent
  
  nit: link on “reasons” was pasted twice. For others it’s https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models
  
  Also hadn’t seen that paper. Thanks!
technicalities Jun 16, 2021, 6:51 AM
5 points
0 ∶ 0

If we take “tangible” to mean executable:
- A primitive prototype and a framework for safety via debate (2018-9). Bit quiet since.
- Carey’s 2019 proof of concept / extension of quantilizers
- Stiennon et al (2020) is an extremely encouraging example of a large negative “alignment tax” (making it safer also made it work better)
But as Kurt Lewin once said “there’s nothing so practical as a good theory”. In particular, theory scales automatically and conceptual work can stop us from wasting effort on the wrong things.
- CAIS (2019) pivots away from the classic agentic model, maybe for the better
- The search for mesa-optimisers (2019) is a step forward from previous muddled thoughts on optimisation, and they make predictions we can test them on soon.
- The Armstrong/Shah discussion of value learning changed my research direction for the better.
Also Everitt et al (2019) is both: a theoretical advance with good software.
- technicalities Jun 16, 2021, 12:48 PM
  5 points
  0 ∶ 0
  Parent
  
  Not recent-recent, but I also really like Carey’s 2017 work on CIRL. Picks a small, well-defined problem and hammers it flush into the ground. “When exactly does this toy system go bad?”

No comments.