This is a very important subject. I’ve also come to a similar conclusion independently recently—that a lot of the AI Safety work being funded by EA orgs (OpenPhil in particular sticks out) is prosaic AI alignment work that is not very neglected. Perversely, as the number of academic and commercial researchers and practitioners working on prosaic alignment as exploded, there has been a shift in the AI Safety community towards prosaic AI work and away from agent foundations type work. From what I hear, that has mainly been because of concerns that scaling transformer-like models will lead to existentially dangerous AI, which I personally think is misguided for reasons that would take too much space to go into here. [I personally think scaling up transformers / deep learning will be part of the AGI puzzle but there are missing pieces, and I would rather people try to figure out what those missing pieces are than focus so much on safety and alignment with deep learning which is not neglected at this point.]
I think this post missed how much prosaic AI safety work is being in commercial settings, since it only focused on academia. I am currently working to deploy models from both commercial companies and academic research labs into hospital radiology clinics so we can gather real world data on performance and get feedback from radiologists. The issue of lack of robustness / distribution shift is a huge problem and we are having to develop additional AI models to identify outliers and anomalies and do automated quality control on images to make sure the right body part is shown and its in the right orientation for whatever AI model we are trying to run. I would argue that every major company trying to deploy ML/DL is struggling with similar robustness issues and there is enormous commercial/market incentives to solves such issues. Furthermore, there is intense pressure on companies to make their models explainable (see the EU’s GDPR legislation which gives users a “right to an explanation” in certain cases). We also see this with transformers—after various debacles at Google (like the black people=Gorillas case in Google photos, anti-semitic autocompletes, etc) and Microsoft (Tay), I’m pretty confident both companies will be putting a lot of effort into aligning transformers (the current default for transformer models is to spew racist crap and misinformation at random intervals which is a complete no-go from a commercial perspective).
So, a lot (but not all!) of the work going on at Anthropic and Redwood, to be perfectly frank, seems very un-neglected to me, although there is an argument that academics will have trouble doing such work because of problems accessing compute and data. [To be clear, I still think its worth having EAs doing alignment with the AI methods/models at the frontier of various benchmarks with an eye to where thinks are going in the future, but what I take issue with is the relative amount of EA funding flowing into such work right now. ]
The amount of research going on in both industry and academia under the umbrella of “fairness, accountability, and transparency” has exploded recently (see the ACM FAccT conference). Research on explainability methods has exploded too (a very non-exhaustive list I did in early 2020 includes some 30 or so explainability methods and I’ve seen graphs showing that the number of papers on explainability /interpretability/XAI is growing exponentially).
There is a very important point I want to make on explainability work in particular though—most of it is “post-hoc” explainability where the main goal is increasing trust (which often, to be frank, just involves showing pretty pictures and telling stories and not actually empirically measuring if trust is increased and if the increased trust is warranted/good). (Sometimes researchers also want to debug failures and detect bias, also, although I have not found many non-trivial cases where this has been achieved). I believe 90%+ percent of current explainability work has very little value to actually understanding how AI models work mechanistically and thus has no relevance to AI existential safety. I keep pointing people to thework of Adebayo et al showing that the most popular XAI methods today are useless for actual mechanistic understanding and the work of Hase et al. showing popular XAI methods don’t help humans much to predict how an ML model will behave. So, work on tools and methods for generating and validating mechanistic explanations in particular still seems very neglected. The only real work that goes to the “gears level” (ie level of neurons/features and weights/dependencies) is the “Circuits” work by Olah et al. at OpenAI. The issue with mechanistic explainability work though is the high amount of work/effort required and that it may be hard to transfer methods to future architectures. For instance if AI moves from using transformers to something like Hebbian networks, energy based models, spiking neural nets, or architectures designed by genetic algorithms (all plausible scenarios in my view) then entirely new tool chains will have to be built and plausibly some methods will have to be thrown out completely and entirely new ones developed. Of course, this is all very unclear… I think more clarity will be obtained once we see how well the tools and methods developed by Olah et al. for CNNs transfer to transformers, which I hear is something they are doing at Anthropic.
Thanks Dan. I agree that industry is more significant (despite the part of it which publishes being ~10x smaller than academic AI research). If you have some insight into the size and quality of the non-publishing part, that would be useful.
Do language models default to racism? As I understand the Tay case, it took thousands of adversarial messages to make it racist.
Agree that the elision of trust and trustworthiness is very toxic. I tend to ignore XAI.
This is a very important subject. I’ve also come to a similar conclusion independently recently—that a lot of the AI Safety work being funded by EA orgs (OpenPhil in particular sticks out) is prosaic AI alignment work that is not very neglected. Perversely, as the number of academic and commercial researchers and practitioners working on prosaic alignment as exploded, there has been a shift in the AI Safety community towards prosaic AI work and away from agent foundations type work. From what I hear, that has mainly been because of concerns that scaling transformer-like models will lead to existentially dangerous AI, which I personally think is misguided for reasons that would take too much space to go into here. [I personally think scaling up transformers / deep learning will be part of the AGI puzzle but there are missing pieces, and I would rather people try to figure out what those missing pieces are than focus so much on safety and alignment with deep learning which is not neglected at this point.]
I think this post missed how much prosaic AI safety work is being in commercial settings, since it only focused on academia. I am currently working to deploy models from both commercial companies and academic research labs into hospital radiology clinics so we can gather real world data on performance and get feedback from radiologists. The issue of lack of robustness / distribution shift is a huge problem and we are having to develop additional AI models to identify outliers and anomalies and do automated quality control on images to make sure the right body part is shown and its in the right orientation for whatever AI model we are trying to run. I would argue that every major company trying to deploy ML/DL is struggling with similar robustness issues and there is enormous commercial/market incentives to solves such issues. Furthermore, there is intense pressure on companies to make their models explainable (see the EU’s GDPR legislation which gives users a “right to an explanation” in certain cases). We also see this with transformers—after various debacles at Google (like the black people=Gorillas case in Google photos, anti-semitic autocompletes, etc) and Microsoft (Tay), I’m pretty confident both companies will be putting a lot of effort into aligning transformers (the current default for transformer models is to spew racist crap and misinformation at random intervals which is a complete no-go from a commercial perspective).
So, a lot (but not all!) of the work going on at Anthropic and Redwood, to be perfectly frank, seems very un-neglected to me, although there is an argument that academics will have trouble doing such work because of problems accessing compute and data. [To be clear, I still think its worth having EAs doing alignment with the AI methods/models at the frontier of various benchmarks with an eye to where thinks are going in the future, but what I take issue with is the relative amount of EA funding flowing into such work right now. ]
The amount of research going on in both industry and academia under the umbrella of “fairness, accountability, and transparency” has exploded recently (see the ACM FAccT conference). Research on explainability methods has exploded too (a very non-exhaustive list I did in early 2020 includes some 30 or so explainability methods and I’ve seen graphs showing that the number of papers on explainability /interpretability/XAI is growing exponentially).
There is a very important point I want to make on explainability work in particular though—most of it is “post-hoc” explainability where the main goal is increasing trust (which often, to be frank, just involves showing pretty pictures and telling stories and not actually empirically measuring if trust is increased and if the increased trust is warranted/good). (Sometimes researchers also want to debug failures and detect bias, also, although I have not found many non-trivial cases where this has been achieved). I believe 90%+ percent of current explainability work has very little value to actually understanding how AI models work mechanistically and thus has no relevance to AI existential safety. I keep pointing people to the work of Adebayo et al showing that the most popular XAI methods today are useless for actual mechanistic understanding and the work of Hase et al. showing popular XAI methods don’t help humans much to predict how an ML model will behave. So, work on tools and methods for generating and validating mechanistic explanations in particular still seems very neglected. The only real work that goes to the “gears level” (ie level of neurons/features and weights/dependencies) is the “Circuits” work by Olah et al. at OpenAI. The issue with mechanistic explainability work though is the high amount of work/effort required and that it may be hard to transfer methods to future architectures. For instance if AI moves from using transformers to something like Hebbian networks, energy based models, spiking neural nets, or architectures designed by genetic algorithms (all plausible scenarios in my view) then entirely new tool chains will have to be built and plausibly some methods will have to be thrown out completely and entirely new ones developed. Of course, this is all very unclear… I think more clarity will be obtained once we see how well the tools and methods developed by Olah et al. for CNNs transfer to transformers, which I hear is something they are doing at Anthropic.
Thanks Dan. I agree that industry is more significant (despite the part of it which publishes being ~10x smaller than academic AI research). If you have some insight into the size and quality of the non-publishing part, that would be useful.
Do language models default to racism? As I understand the Tay case, it took thousands of adversarial messages to make it racist.
Agree that the elision of trust and trustworthiness is very toxic. I tend to ignore XAI.