Neel Nanda comments on Critiques of prominent AI safety labs: Redwood Research

Neel Nanda Apr 2, 2023, 8:16 PM
5 points
1 ∶ 0
Sorry for the long + rambly comment! I appreciate the pushback, and found clarifying my thoughts on this useful

I broadly agree that all of the funding ideas you point to seem decent. My biggest crux is that the counterfactual of not funding Redwood is not that one of those gets funded, and that the real constraints here around logistical effort, grantmaker time, etc. I wrote a comment downthread with further thoughts on these points.

And that it is not Redwood’s job to solve this—they’re pursuing a theory of change that does not depend on these, and it seems very unreasonable to suggest that they should pursue one of these other uses of money instead, even if you think that the use of money is a great idea.

Re 1, concretely, I’ve been trying to help one of those professors get more funding for his lab, and think this is a high impact use of money. But think that evaluating professors is hard, thinking through capabilities externalities is hard, figuring out a lab’s room for more funding is hard, it’s harder to burn a ton of money productively in academia, eg >$1mn (eg, it’s pretty hard to just hire a bunch of engineers, and interp doesn’t really need a ton of compute). There’s also dumb network problems where the academics don’t know how to get funding, it’s not very legible how to apply to OpenPhil, not everyone is comfortable taking EA money, etc (I would like these problems to be solved, to be clear). I don’t think it’s a matter of just having more money.

Poach experienced researchers who are executing well on interpretability but working on what (by Redwood’s lights) are less important problems, and redirect them to more important problems. Not everyone would want to be “redirected”, but there’s a decent fraction of people who would love to work on more ambitious problems but are currently not incentivized to do so

I don’t know anyone like this. If you do, please intro me! (I met someone vaguely in this category and helped them to get an FTX grant at the start of November.… But they only tangentially fit your description). I’m pretty unconvinced there’s many people like this out there who could be redirected to productively do what I consider good interp work—beyond just motivation and interest in doing independent-ish work, there’s also significant considerations of research taste, having mentorship to do work I think is important, etc.

Make one-year seed grants of around $100k to 20 early-career researchers (PhD students, independent researchers) to work on interpretability, nudging them towards a list of problems viewed important by Redwood. Provide low-touch mentorship (e.g. once a month call). Scale up the grants and/or hire people from the projects that did well after the one-year trial.

Seems good, I’d be excited about this happening. I consider my MATS scholars to be vaguely in the spirit of this, and I’ve been very impressed with them. But, like, this is so not bottlenecked on money. It’s a substantial program that would take effort to run, it’s not clear to me that these people would do good work without mentorship (1/month might be sufficient), it’s not clear that this adds much value beyond existing independent researcher grants, etc. But I do think it’s a decent idea—if anyone is interested in making this happen, please reach out!

However, the denominator is very large, so I still expect the majority of TAIS-relevant interpretability work to happen outside TAIS organizations

There’s some work I think is cool, but it tends to be concentrated in a small handful of actually good labs (eg I like ROME and Emergent World Representations a lot). There’s a bunch of work I think isn’t great, but sometimes has great gems in it. But honestly I think that well over a majority of impact weighted TAIS work was done by the TAIS community (specifically, Chris Olah + collaborator’s work is quite possibly a majority in my mind). I’d be interested in being pointed to work that you think is great that I’m missing—I personally find literature reviews to be pretty tedious, and think I underinvest in this kind of thing.

More broadly, my position is that engaging with academia is a theory of change, but one of many. It’s a significant investment of time, some people are much better at it than others (eg, I personally just hate writing papers, and am much worse at it than just directly trying to do good research, or write blog posts/educational materials/good tooling), it’s hard to direct in targeted ways, benefits a bunch from legible signalling and credentials, etc. I also think Redwood are more pessimistic on it than I am, and eg I am personally not convinced that trying to get grokking into ICLR was a good use of time and effort (though I hope it was!). I think Redwood are making a pretty reasonable bet here.

As a negative example here, I think Distill was a major investment of effort into influencing academia, including on doing better interp work, and it basically failed as far as I can tell (despite, to my eyes, Distill papers being notably higher quality and more interesting than conference papers)

I’m curious if you view this as being significantly more costly than I do, or the improvements to the project from peer-review as being less significant.

I want to distinguish two things—putting in the effort to make a write-up really good, and putting in the effort to eg get it accepted at ICLR/ICML/NeurIPS. I am pretty pro making write-ups really good (I personally am not very good at it and try to avoid it where possible, but this is a personal taste not a value judgement). Eg I really like Anthropic interp papers (though am biased) and think the effort put into presentation and clarity is pretty well spent. And I think that part of submitting to a top conference is making things tightly and clearly phrased, having good figures, making them well presented, having good evidence for your results.

IMO the biggest cost is shaping the results and narrative of your work to fit the kind of thing that reviewers look for, and think is good. I broadly think this just isn’t that correlated with what good interp work looks like. I think this can be extremely expensive if you let it shape your research process, choice of projects, etc for “this would make a good publication”. In cases like grokking, I did the research I wanted to do, and we then decided to go for a publication, which I think was basically fine, and much less costly. But it did involve significant reshaping and optimisation of the narrative (I am personally sad that progress measures got into the title lol).

Idk, these are complex questions, and there are people I respect who are way more or less pro academia + publishing than me. I am personally pretty biased against academia and publishing, and this affects my value judgements here.