Stiennon et al (2020) is an extremely encouraging example of a large negative “alignment tax” (making it safer also made it work better)
But as Kurt Lewin once said “there’s nothing so practical as a good theory”. In particular, theory scales automatically and conceptual work can stop us from wasting effort on the wrong things.
CAIS (2019) pivots away from the classic agentic model, maybe for the better
The search for mesa-optimisers (2019) is a step forward from previous muddled thoughts on optimisation, and they make predictions we can test them on soon.
Not recent-recent, but I also really like Carey’s 2017 work on CIRL. Picks a small, well-defined problem and hammers it flush into the ground. “When exactly does this toy system go bad?”
If we take “tangible” to mean executable:
A primitive prototype and a framework for safety via debate (2018-9). Bit quiet since.
Carey’s 2019 proof of concept / extension of quantilizers
Stiennon et al (2020) is an extremely encouraging example of a large negative “alignment tax” (making it safer also made it work better)
But as Kurt Lewin once said “there’s nothing so practical as a good theory”. In particular, theory scales automatically and conceptual work can stop us from wasting effort on the wrong things.
CAIS (2019) pivots away from the classic agentic model, maybe for the better
The search for mesa-optimisers (2019) is a step forward from previous muddled thoughts on optimisation, and they make predictions we can test them on soon.
The Armstrong/Shah discussion of value learning changed my research direction for the better.
Also Everitt et al (2019) is both: a theoretical advance with good software.
Not recent-recent, but I also really like Carey’s 2017 work on CIRL. Picks a small, well-defined problem and hammers it flush into the ground. “When exactly does this toy system go bad?”