I have so far gotten the same impression that making RLHF work as a strategy by iteratively and kind of gradually scaling it in a very operationally secure way seems like maybe the most promising approach. My viewpoint right now still remains as the one you’ve expressed about how, while as much as the RLHF++ has going for it in a relative sense, in leaves a lot to be desired in an absolute sense in light of the alignment/control problem for AGI.
Overall, I really appreciate how this post condenses well in detail what is increasingly common knowledge about just how inadequate are the sum total of major approaches being taken to alignment. I’ve read analyses with the same current conclusion from several other AGI safety/alignment researchers during the last year or two. Yet where I hit a wall is my strong sense that any alternative approaches could just as easily succumb to most if not all of the same major pitfalls you list the RLHF++ approach of having to contend with. In that sense, I also feel most of your points are redundant, To get specific about how your criticisms of RLHF apply to all the other alignment approaches as well…
This currently feels way too much like “improvise as we go along and cross our fingers” to be Plan A; this should be Plan B or Plan E.
Whether it’s an approach inspired by the paradigms established in light of Christiano, Yudkowsky, interpretability research, or elsewise, I’ve gotten the sense essentially all alignment researchers honestly feel the same way about whatever approach to RLHF they’re taking.
“It might well not work. I expect this to harvest a bunch of low-hanging fruit[...]This really shouldn’t be our only plan.
I understand how it feels like, based on how some people tend to talk about RLHF, and sometimes interpretability, they’re implying or suggesting that we’ll be fine with just this one approach. At the same time, as far as I’m aware, when you get behind any hype, almost everyone admits that whatever particular approach to alignment they’re taking may fail to generalize and shouldn’t be the only plan.
It rests on pretty unclear empirical assumptions on how crunchtime will go.
I’ve gotten the sense that the empirical assumption for how crunchtime will go among researchers taking the RLHF approach is, for lack of a better term, kind of a medium-term forecast for the date of the tipping point for AGI, i.e., probably at least between 2030 and 2040, as opposed to between 2025 and 2030.
Given this or that certain chain/sequence of logical assumptions about the trajectory or acceleration of capabilities research, there of course is an intuitive case to be made, on rational-theoretic grounds, for acting/operating under the presumption in practice that forecasts of short(er) AGI timelines, e.g., between 1 and 5 years out, are just correct and the most accurate.
At the same time, such models for timeline and/or trajectory towards AGI anyone could, just as easily, be totally wrong. Those research teams most dedicated to really solving the control problem for transformative/general AI with the shortest timelines are also acting under assumptions derived from models that also severely lacking any empirical basis.
As far as I’m aware, there is a combined set of several for-profit startups, and non-profit research organizations, that have been trialing state-of-the-art approaches for prediction markets and forecasting methodologies, especially timelines and trajectories of capabilities research for transformative/general AI.
During the last few years, they’ve altogether received, at least, a few million dollars to run so many experiments to determine how to achieve more empirically based models for AI timelines or trajectories. While there may potentially be valuable insights for empirical forecasting methods overall, I’m not aware of any results at all vindicating, literally, any theoretical model for capabilities forecasting.
I’m not sure this plan puts us on track to get to a place where we can be confident that scalable alignment is solved. By default, I’d guess we’d end up in a fairly ambiguous situation.
This is yet just another criticism of the RLHF approach that I understand as just as easily applying to any approach to alignment you’ve mentioned, and even every remotely significant approach to alignment you didn’t mention but I’ve also encountered.
You also mentioned, for both the relatively cohesive (set of) approach(es) inspired by Christiano’s research, or for more idiosyncratic approaches, a la MIRI, you perceive to be a dead end the very abstract, and almost purely mathematical, approach being taken. That’s an understandable and sympathetic take. All things being equal, I’d agree with your proposal for what should be done instead:
We need a concerted effort that matches the gravity of the challenge. The best ML researchers in the world should be working on this! There should be billion-dollar, large-scale efforts with the scale and ambition of Operation Warp Speed or the moon landing or even OpenAI’s GPT-4 team itself working on this problem.
Unfortunately, all things are not equal. The societies we live in will. for foreseeable future. keep operating on a set of very unfortunate incentive structures.
To so strongly invest into ML-based approaches to alignment research, in practice, often entails working in some capacity of advancing capabilities research even more, especially in industry and the private sector. That’s a major reason why, regardless of whatever ways they might be superior, ML-based approaches to alignment are often eschewed.
I.e., most conscientious alignment researchers don’t feel like their field is ready to pivot so fully to ML-based approaches to alignment, without in the process increasing whatever existential risk super-human AGI might pose to humanity, as opposed to decreasing such risk. As harsh as I’m maybe being, I also think the most novel and valuable propositions in this post are your own you’ve downplayed:
For example, I’m really excited about work like this recent paper (paper, blog post on broader vision), which prototypes a method to detect “whether a model is being honest” via unsupervised methods. More than just this specific result, I’m excited about the style:
Use conceptual thinking to identify methods that might plausibly scale to superhuman methods (here: unsupervised methods, which don’t rely on human supervision)
Empirically test this with current models.
I think there’s a lot more to do in this vein—carefully thinking about empirical setups that are analogous to the core difficulties of scalable alignment, and then empirically testing and iterating on relevant ML methods.
My one recommendation is that you don’t dwell any longer on so many things in AI alignment as a field most alignment researchers already acknowledge, and get down your proposals for taking an evidence-based approach to expanding the robustness of alignment of unsupervised systems. That’s as exciting a new research direction I’ve heard of in the last year too!
I have so far gotten the same impression that making RLHF work as a strategy by iteratively and kind of gradually scaling it in a very operationally secure way seems like maybe the most promising approach. My viewpoint right now still remains as the one you’ve expressed about how, while as much as the RLHF++ has going for it in a relative sense, in leaves a lot to be desired in an absolute sense in light of the alignment/control problem for AGI.
Overall, I really appreciate how this post condenses well in detail what is increasingly common knowledge about just how inadequate are the sum total of major approaches being taken to alignment. I’ve read analyses with the same current conclusion from several other AGI safety/alignment researchers during the last year or two. Yet where I hit a wall is my strong sense that any alternative approaches could just as easily succumb to most if not all of the same major pitfalls you list the RLHF++ approach of having to contend with. In that sense, I also feel most of your points are redundant, To get specific about how your criticisms of RLHF apply to all the other alignment approaches as well…
Whether it’s an approach inspired by the paradigms established in light of Christiano, Yudkowsky, interpretability research, or elsewise, I’ve gotten the sense essentially all alignment researchers honestly feel the same way about whatever approach to RLHF they’re taking.
I understand how it feels like, based on how some people tend to talk about RLHF, and sometimes interpretability, they’re implying or suggesting that we’ll be fine with just this one approach. At the same time, as far as I’m aware, when you get behind any hype, almost everyone admits that whatever particular approach to alignment they’re taking may fail to generalize and shouldn’t be the only plan.
I’ve gotten the sense that the empirical assumption for how crunchtime will go among researchers taking the RLHF approach is, for lack of a better term, kind of a medium-term forecast for the date of the tipping point for AGI, i.e., probably at least between 2030 and 2040, as opposed to between 2025 and 2030.
Given this or that certain chain/sequence of logical assumptions about the trajectory or acceleration of capabilities research, there of course is an intuitive case to be made, on rational-theoretic grounds, for acting/operating under the presumption in practice that forecasts of short(er) AGI timelines, e.g., between 1 and 5 years out, are just correct and the most accurate.
At the same time, such models for timeline and/or trajectory towards AGI anyone could, just as easily, be totally wrong. Those research teams most dedicated to really solving the control problem for transformative/general AI with the shortest timelines are also acting under assumptions derived from models that also severely lacking any empirical basis.
As far as I’m aware, there is a combined set of several for-profit startups, and non-profit research organizations, that have been trialing state-of-the-art approaches for prediction markets and forecasting methodologies, especially timelines and trajectories of capabilities research for transformative/general AI.
During the last few years, they’ve altogether received, at least, a few million dollars to run so many experiments to determine how to achieve more empirically based models for AI timelines or trajectories. While there may potentially be valuable insights for empirical forecasting methods overall, I’m not aware of any results at all vindicating, literally, any theoretical model for capabilities forecasting.
This is yet just another criticism of the RLHF approach that I understand as just as easily applying to any approach to alignment you’ve mentioned, and even every remotely significant approach to alignment you didn’t mention but I’ve also encountered.
You also mentioned, for both the relatively cohesive (set of) approach(es) inspired by Christiano’s research, or for more idiosyncratic approaches, a la MIRI, you perceive to be a dead end the very abstract, and almost purely mathematical, approach being taken. That’s an understandable and sympathetic take. All things being equal, I’d agree with your proposal for what should be done instead:
Unfortunately, all things are not equal. The societies we live in will. for foreseeable future. keep operating on a set of very unfortunate incentive structures.
To so strongly invest into ML-based approaches to alignment research, in practice, often entails working in some capacity of advancing capabilities research even more, especially in industry and the private sector. That’s a major reason why, regardless of whatever ways they might be superior, ML-based approaches to alignment are often eschewed.
I.e., most conscientious alignment researchers don’t feel like their field is ready to pivot so fully to ML-based approaches to alignment, without in the process increasing whatever existential risk super-human AGI might pose to humanity, as opposed to decreasing such risk. As harsh as I’m maybe being, I also think the most novel and valuable propositions in this post are your own you’ve downplayed:
My one recommendation is that you don’t dwell any longer on so many things in AI alignment as a field most alignment researchers already acknowledge, and get down your proposals for taking an evidence-based approach to expanding the robustness of alignment of unsupervised systems. That’s as exciting a new research direction I’ve heard of in the last year too!