Redwood Research and Constellation
Nate Thomas
Thanks for the comment Dan. I agree that the adversarially mined examples literature is the right reference class, of which the two that you mention (Meta’s Dynabench and ANLI) were the main examples (maybe the only examples? I forget) while we were working on this project.
I’ll note that Meta’s Dynabench sentiment model (the only model of theirs that I interacted with) seemed substantially less robust than Redwood’s classifier (e.g. I was able to defeat it manually in about 10 minutes of messing around, whereas I needed the tools we made to defeat the Redwood model).
Thanks to the authors for taking the time to think about how to improve our organization and the field of AI takeover prevention as a whole. I share a lot of the concerns mentioned in this post, and I’ve been spending a lot of my attention trying to improve some of them (though I also have important disagreements with parts of the post).
Here’s some information that perhaps supports some of the points made in the post and adds texture, since it seems hard to properly critique a small organization without a lot of context and inside information. (This is adapted from my notes over the past few months.)Most importantly, I am eager to increase our rate of research output – and critically to have that increase be sustainable because it’s done by a more stable and well-functioning team. I don’t think we should be satisfied with the current output rate, and I think this rate being too low is in substantial part due to not having had the right organizational shape or sufficiently solid management practices (which, in empathy with the past selves of the Redwood leadership team, is often a tricky thing for young organizations to figure out, and is perhaps especially tricky in this field).
I think the most important error that we’ve made so far is trying to scale up too quickly. I feel bad about the ways in which this has contributed to people who’ve worked here having an unexpectedly bad experience. I believe this was upstream of other organizational mistakes and that it put stress on our relative inexperience in management. While having fewer staff gives fewer people a chance to have roles working on our type of AI alignment research, I expect it will help increase the management quality per person. For example, I think there will be more and better opportunities for researchers at Redwood to grow, which is something I’ve been excited to focus on. I think scaling too quickly was somewhat downstream of not having an extremely clear articulation of what specific flavor of research output we are aiming to produce and, in turn, having a tested organization that we believe reliably produces those outputs.
I think this was an unforced error on our part – for example, Holden and Ajeya expressed concerns to me about this multiple times. My thinking at the time was something like “this sure seems like a pretty confusing field in a lot of ways, and (something something act-omission bias) I’m worried that if we chose an unrealistically high standard for clarity to gate on for organizational growth, then we might learn more slowly than we might otherwise, and fail to give people opportunities to contribute to the field.” I now think that I was wrong about this.
With that said, I’ll also briefly note some of the ways I disagree with the content and framing of this post:We think our “causal scrubbing” work is our most significant output so far – substantially more important than, for example, our “Interpretability in the Wild” work.
At the beginning of our adversarial training project, we reviewed the literature (including the papers in the list that the above post links to) and discussed the project proposal with relevant experts. I think we made important mistakes in that project, but I don’t think that we failed to understand the state of the field.
I am moderately optimistic about Redwood’s current trajectory and our potential to contribute to making the future go well. I feel substantially better about the place that we’re in now relative to where we were, say, 6 months ago. We remain a relatively young organization making an unusual bet.
I really appreciate feedback, and if anyone reading this wants to send feedback to us about Redwood, you can email info at rdwrs.com or, if you prefer anonymity, visit www.admonymous.co/redwood.- 2 Apr 2023 17:53 UTC; 7 points) 's comment on Critiques of prominent AI safety labs: Redwood Research by (
(I’ll use this comment to also discuss some aspects of some other questions that have been asked.)
I think there are currently something like three categories of bottlenecks on alignment research:
Having many tractable projects to work on that we expect will help (this may be limited by theoretical understanding / lack of end-to-end alignment solution)
Institutional structures that make it easy to coordinate to work on alignment
People who will attack the problem if they’re given some good institutional framework
Regarding 1 (“tractable projects / theoretical understanding”): Maybe in the next few years we will come to have clearer and more concrete schemes for aligning superhuman AI, and this might make it easier to scope engineering-requiring research projects that implement or test parts of those plans. ARC, Paul Christiano’s research organization, is one group that is working towards this.
Regarding 2 (“institutional structures”), I think of there being 5 major categories of institutions that could house AI alignment researchers:
Alignment-focused research organizations (such as ARC or Redwood Research)
Industry labs (such as OpenAI or DeepMind)
Academia
Independent work
Government agencies (none exist currently that I’m aware of, but maybe they will in the future)
Redwood Research is currently focused on 2. One of the hypotheses behind Redwood’s current organizational structure is “it’s important for organizations to focus closely on alignment research if they want to produce a lot of high-quality alignment research” (see, for example, common startup advice such as “The most important thing for startups to do is to focus” (Paul Graham)). My guess is that it’s generally tricky to stay focused on the problems that are most likely to be core alignment problems, and I’m not sure how to do it well in some institutions. I’m excited about the prospect of alignment-focused research organizations that are carefully focused on x-risk-reducing alignment work and willing to deploy resources and increase headcount toward this work.
At Redwood, our current plan is to
solicit project ideas that are theoretically motivated (ie they have some compelling story for how they are either analogous to or directly solving xrisk-associated problems for alignment of superintelligent systems) from researchers across the field of x-risk-motivated AI alignment,
hire researchers and engineers who we expect to help execute on those projects, and
provide the managerial and operational support for them to successfully complete those projects.
There are various reasons why a focus on focus might not be the right call, such as “it’s important to have close contact with top ML researchers, even if they don’t care about working on alignment right now, otherwise you’ll be much worse at doing ML research” or “it’s important to use the latest technology, which could require developing that technology in house”. This is why I think industry labs may be a reasonable bet. My guess is that (with respect to quality-adjusted output of alignment research) they have lower variance but also lower upside. Roughly speaking, I am currently somewhat less excited about academia, independent work, and government agencies, but I’m also fairly uncertain, and also there are definitely people and types of work that might be much better in these homes.
To wildly speculate, I could imagine a good and achievable distribution across institutions being 500 in alignment-focused research organizations (who might be much more willing and able to productively absorb people for alignment research), 300 in industry labs, 100 in academia, 50 independent researchers, and 50 in government agencies (but plausibly these numbers should be very different in particular circumstances). Of course “number of people working in the field” is far from an ideal proxy for total productivity, so I’ve tried to adjust for targettedness and quality of their output in my discussion here.
I estimate the current size of the field of x-risk-reduction-motivated AI alignment research is 100 people (very roughly speaking, rounded to an order of magnitude), so 1000 people would constitute something like a 10x increase. (My guesses for the current distribution is 30 in alignment orgs, 30 in industry labs, 30 in academia, 10 independent researchers, and 0 in government (very rough numbers, rounded to nearest half order of magnitude).) I’d guess there are at this time something like 30 − 100 people who, though they are not currently working on x-risk-motivated AI alignment research, would start working on this if the right institutions existed. I would like this number (of potential people) to grow a lot in the future.
Regarding 3 (“people”), the spread of the idea that it would be good to reduce x-risks from TAI (and maybe general growth of the EA movement) could increase the size and quality of the pool of people who would develop and execute on alignment projects. I am excited for the work that Open Philanthropy and university student groups such as Stanford EA are doing towards this end.
I’m currently unsure what an appropriate fraction of the technical staff of alignment-focused research organizations should be people who understand and care a lot about x-risk-motivated alignment research. I could imagine that ratio being something like 10%, or like 90%, or in between.
I think there’s a case to be made that alignment research is bottlenecked by current ML capabilities, but I (unconfidently) don’t think that this is currently a bottleneck; I think there is a bunch more alignment research that could be done now with current capabilities (eg my guess is that less than 50% of the alignment work that could be done at current levels of capabilities has been done—I could imagine there being something like 10 or more projects that are as helpful as “Deep RL from human preferences” or “Learning to summarize from human feedback”).
Re your point about “building an institution” and step 3: We think the majority of our expected value comes from futures in which we produce more research value per dollar than in the past.
(Also, just wanted to note again that $20M isn’t the right number to use here, since around 1/3rd of that funding is for running Constellation, as mentioned in the post.)