Some mathy AI safety pieces or other related material off the top of my head (in no particular order, and definitely not comprehensive nor weighted toward impact or influence):
The Speed + Simplicity Prior is probably anti-deceptive
Prediction can be Outer Aligned at Optimum
Reinforcement Learning in Newcomblike Environments
Commitment games with conditional information revelation
Chris Olah’s older pieces on neural networks (under ‘Neural Networks (General)’ and below)
Some mathy AI safety pieces or other related material off the top of my head (in no particular order, and definitely not comprehensive nor weighted toward impact or influence):
The Speed + Simplicity Prior is probably anti-deceptive
Prediction can be Outer Aligned at Optimum
Reinforcement Learning in Newcomblike Environments
Commitment games with conditional information revelation
Chris Olah’s older pieces on neural networks (under ‘Neural Networks (General)’ and below)