technicalities answers What is an example of recent, tangible progress in AI safety research?

technicalities Jun 16, 2021, 6:51 AM
5 points
0 ∶ 0
If we take “tangible” to mean executable:
- A primitive prototype and a framework for safety via debate (2018-9). Bit quiet since.
- Carey’s 2019 proof of concept / extension of quantilizers
- Stiennon et al (2020) is an extremely encouraging example of a large negative “alignment tax” (making it safer also made it work better)
But as Kurt Lewin once said “there’s nothing so practical as a good theory”. In particular, theory scales automatically and conceptual work can stop us from wasting effort on the wrong things.
- CAIS (2019) pivots away from the classic agentic model, maybe for the better
- The search for mesa-optimisers (2019) is a step forward from previous muddled thoughts on optimisation, and they make predictions we can test them on soon.
- The Armstrong/Shah discussion of value learning changed my research direction for the better.
Also Everitt et al (2019) is both: a theoretical advance with good software.
- technicalities Jun 16, 2021, 12:48 PM
  5 points
  0 ∶ 0
  Parent
  Not recent-recent, but I also really like Carey’s 2017 work on CIRL. Picks a small, well-defined problem and hammers it flush into the ground. “When exactly does this toy system go bad?”