(written in first person because one post author wrote it)
As Nuno notes, I can;t see how else to spend $20M to get more good interp work (naively, I’m not claiming no such ways exist)
I think this is the area we disagree on the most. Examples of other ideas:
1. Generously fund the academics who you do think are doing good work (as far as I can tell, two of them—Christopher Pott and Martin Watternberg—get no funding from OP, and David Bau gets an order of magnitude less). This is probably more on OP than Redwood, but Redwood could also explore funding academics and working on projects in collaboration with them.
2. Poach experienced researchers who are executing well on interpretability but working on what (by Redwood’s lights) are less important problems, and redirect them to more important problems. Not everyone would want to be “redirected”, but there’s a decent fraction of people who would love to work on more ambitious problems but are currently not incentivized to do so, and a broader range of people are open to working on a wide range of problems so long as they are interesting. I would expect these individuals to cost a comparable amount to what Redwood currently pays (somewhat less if poaching from academia, somewhat more if poaching from industry) but be able to execute more quickly as well as spread valuable expertise around the organization.
3. Make one-year seed grants of around $100k to 20 early-career researchers (PhD students, independent researchers) to work on interpretability, nudging them towards a list of problems viewed important by Redwood. Provide low-touch mentorship (e.g. once a month call). Scale up the grants and/or hire people from the projects that did well after the one-year trial.
I wouldn’t confidently claim that any of these approaches would necessarily best Redwood, but there’s a large space of possibilities that could be explored and largely has not been. Notably, the ideas above differ from Redwood’s high-level strategy to date by: (a) making bets on a broad portfolio of agendas; (b) starting small and evaluating projects before scaling; (c) bringing in external expertise and talent.
I also broadly think that publishing and engaging with the broader ML community is less obviously good for interpretability, as noted I just don’t think most work is very relevant. I think it’s a bet worth making (and am excited about interp in the wild and my grokking work getting into ICLR!), but definitely not obviously worth the effort, eg I think it’s probably the right call that Anthropic doesn;t try to publish their work. Putting pre-prints on Arxiv seems pretty cheap, and I’m pro that, but I think seriously aiming for academic publications is a lot of work (more than 10-20% of a project IMO) and I feel pretty good about Redwood only trying for this when they have employees who are particularly excited about it.
I think I largely agree the percentage of interpretability papers that are relevant to large-scale alignment is disappointingly low. However, the denominator is very large, so I still expect the majority of TAIS-relevant interpretability work to happen outside TAIS organizations. Given this I’d argue there’s considerable value communicating to this subset of the ML research community. Perhaps a peer-reviewed publication is not the best way to do this: I’d be happy to see Redwood staff e.g. giving talks at a select subset of academic labs, but to the best of our knowledge this hasn’t happened.
I agree that getting from the stage of “scrappy preprint / blog post that your close collaborators can understand” to “peer-reviewed publication” can be 10-20% of a project’s time. However, in my experience the clarity of the write-up and rigor of the results often increase considerably in that 10-20%. There are some parts of the publication process that are complete wastes of time (reformatting from single to double column, running an experiment that you already know the results of but that reviewer 2 really wants to see), but in my experience these have been a minority of the work—no more than 5% of the overall project time. I’m curious if you view this as being significantly more costly than I do, or the improvements to the project from peer-review as being less significant.
Sorry for the long + rambly comment! I appreciate the pushback, and found clarifying my thoughts on this useful
I broadly agree that all of the funding ideas you point to seem decent. My biggest crux is that the counterfactual of not funding Redwood is not that one of those gets funded, and that the real constraints here around logistical effort, grantmaker time, etc. I wrote a comment downthread with further thoughts on these points.
And that it is not Redwood’s job to solve this—they’re pursuing a theory of change that does not depend on these, and it seems very unreasonable to suggest that they should pursue one of these other uses of money instead, even if you think that the use of money is a great idea.
Re 1, concretely, I’ve been trying to help one of those professors get more funding for his lab, and think this is a high impact use of money. But think that evaluating professors is hard, thinking through capabilities externalities is hard, figuring out a lab’s room for more funding is hard, it’s harder to burn a ton of money productively in academia, eg >$1mn (eg, it’s pretty hard to just hire a bunch of engineers, and interp doesn’t really need a ton of compute). There’s also dumb network problems where the academics don’t know how to get funding, it’s not very legible how to apply to OpenPhil, not everyone is comfortable taking EA money, etc (I would like these problems to be solved, to be clear). I don’t think it’s a matter of just having more money.
Poach experienced researchers who are executing well on interpretability but working on what (by Redwood’s lights) are less important problems, and redirect them to more important problems. Not everyone would want to be “redirected”, but there’s a decent fraction of people who would love to work on more ambitious problems but are currently not incentivized to do so
I don’t know anyone like this. If you do, please intro me! (I met someone vaguely in this category and helped them to get an FTX grant at the start of November.… But they only tangentially fit your description). I’m pretty unconvinced there’s many people like this out there who could be redirected to productively do what I consider good interp work—beyond just motivation and interest in doing independent-ish work, there’s also significant considerations of research taste, having mentorship to do work I think is important, etc.
Make one-year seed grants of around $100k to 20 early-career researchers (PhD students, independent researchers) to work on interpretability, nudging them towards a list of problems viewed important by Redwood. Provide low-touch mentorship (e.g. once a month call). Scale up the grants and/or hire people from the projects that did well after the one-year trial.
Seems good, I’d be excited about this happening. I consider my MATS scholars to be vaguely in the spirit of this, and I’ve been very impressed with them. But, like, this is so not bottlenecked on money. It’s a substantial program that would take effort to run, it’s not clear to me that these people would do good work without mentorship (1/month might be sufficient), it’s not clear that this adds much value beyond existing independent researcher grants, etc. But I do think it’s a decent idea—if anyone is interested in making this happen, please reach out!
However, the denominator is very large, so I still expect the majority of TAIS-relevant interpretability work to happen outside TAIS organizations
There’s some work I think is cool, but it tends to be concentrated in a small handful of actually good labs (eg I like ROME and Emergent World Representations a lot). There’s a bunch of work I think isn’t great, but sometimes has great gems in it. But honestly I think that well over a majority of impact weighted TAIS work was done by the TAIS community (specifically, Chris Olah + collaborator’s work is quite possibly a majority in my mind). I’d be interested in being pointed to work that you think is great that I’m missing—I personally find literature reviews to be pretty tedious, and think I underinvest in this kind of thing.
More broadly, my position is that engaging with academia is a theory of change, but one of many. It’s a significant investment of time, some people are much better at it than others (eg, I personally just hate writing papers, and am much worse at it than just directly trying to do good research, or write blog posts/educational materials/good tooling), it’s hard to direct in targeted ways, benefits a bunch from legible signalling and credentials, etc. I also think Redwood are more pessimistic on it than I am, and eg I am personally not convinced that trying to get grokking into ICLR was a good use of time and effort (though I hope it was!). I think Redwood are making a pretty reasonable bet here.
As a negative example here, I think Distill was a major investment of effort into influencing academia, including on doing better interp work, and it basically failed as far as I can tell (despite, to my eyes, Distill papers being notably higher quality and more interesting than conference papers)
I’m curious if you view this as being significantly more costly than I do, or the improvements to the project from peer-review as being less significant.
I want to distinguish two things—putting in the effort to make a write-up really good, and putting in the effort to eg get it accepted at ICLR/ICML/NeurIPS. I am pretty pro making write-ups really good (I personally am not very good at it and try to avoid it where possible, but this is a personal taste not a value judgement). Eg I really like Anthropic interp papers (though am biased) and think the effort put into presentation and clarity is pretty well spent. And I think that part of submitting to a top conference is making things tightly and clearly phrased, having good figures, making them well presented, having good evidence for your results.
IMO the biggest cost is shaping the results and narrative of your work to fit the kind of thing that reviewers look for, and think is good. I broadly think this just isn’t that correlated with what good interp work looks like. I think this can be extremely expensive if you let it shape your research process, choice of projects, etc for “this would make a good publication”. In cases like grokking, I did the research I wanted to do, and we then decided to go for a publication, which I think was basically fine, and much less costly. But it did involve significant reshaping and optimisation of the narrative (I am personally sad that progress measures got into the title lol).
Idk, these are complex questions, and there are people I respect who are way more or less pro academia + publishing than me. I am personally pretty biased against academia and publishing, and this affects my value judgements here.
(written in first person because one post author wrote it)
I think this is the area we disagree on the most. Examples of other ideas:
1. Generously fund the academics who you do think are doing good work (as far as I can tell, two of them—Christopher Pott and Martin Watternberg—get no funding from OP, and David Bau gets an order of magnitude less). This is probably more on OP than Redwood, but Redwood could also explore funding academics and working on projects in collaboration with them.
2. Poach experienced researchers who are executing well on interpretability but working on what (by Redwood’s lights) are less important problems, and redirect them to more important problems. Not everyone would want to be “redirected”, but there’s a decent fraction of people who would love to work on more ambitious problems but are currently not incentivized to do so, and a broader range of people are open to working on a wide range of problems so long as they are interesting. I would expect these individuals to cost a comparable amount to what Redwood currently pays (somewhat less if poaching from academia, somewhat more if poaching from industry) but be able to execute more quickly as well as spread valuable expertise around the organization.
3. Make one-year seed grants of around $100k to 20 early-career researchers (PhD students, independent researchers) to work on interpretability, nudging them towards a list of problems viewed important by Redwood. Provide low-touch mentorship (e.g. once a month call). Scale up the grants and/or hire people from the projects that did well after the one-year trial.
I wouldn’t confidently claim that any of these approaches would necessarily best Redwood, but there’s a large space of possibilities that could be explored and largely has not been. Notably, the ideas above differ from Redwood’s high-level strategy to date by: (a) making bets on a broad portfolio of agendas; (b) starting small and evaluating projects before scaling; (c) bringing in external expertise and talent.
I think I largely agree the percentage of interpretability papers that are relevant to large-scale alignment is disappointingly low. However, the denominator is very large, so I still expect the majority of TAIS-relevant interpretability work to happen outside TAIS organizations. Given this I’d argue there’s considerable value communicating to this subset of the ML research community. Perhaps a peer-reviewed publication is not the best way to do this: I’d be happy to see Redwood staff e.g. giving talks at a select subset of academic labs, but to the best of our knowledge this hasn’t happened.
I agree that getting from the stage of “scrappy preprint / blog post that your close collaborators can understand” to “peer-reviewed publication” can be 10-20% of a project’s time. However, in my experience the clarity of the write-up and rigor of the results often increase considerably in that 10-20%. There are some parts of the publication process that are complete wastes of time (reformatting from single to double column, running an experiment that you already know the results of but that reviewer 2 really wants to see), but in my experience these have been a minority of the work—no more than 5% of the overall project time. I’m curious if you view this as being significantly more costly than I do, or the improvements to the project from peer-review as being less significant.
Sorry for the long + rambly comment! I appreciate the pushback, and found clarifying my thoughts on this useful
I broadly agree that all of the funding ideas you point to seem decent. My biggest crux is that the counterfactual of not funding Redwood is not that one of those gets funded, and that the real constraints here around logistical effort, grantmaker time, etc. I wrote a comment downthread with further thoughts on these points.
And that it is not Redwood’s job to solve this—they’re pursuing a theory of change that does not depend on these, and it seems very unreasonable to suggest that they should pursue one of these other uses of money instead, even if you think that the use of money is a great idea.
Re 1, concretely, I’ve been trying to help one of those professors get more funding for his lab, and think this is a high impact use of money. But think that evaluating professors is hard, thinking through capabilities externalities is hard, figuring out a lab’s room for more funding is hard, it’s harder to burn a ton of money productively in academia, eg >$1mn (eg, it’s pretty hard to just hire a bunch of engineers, and interp doesn’t really need a ton of compute). There’s also dumb network problems where the academics don’t know how to get funding, it’s not very legible how to apply to OpenPhil, not everyone is comfortable taking EA money, etc (I would like these problems to be solved, to be clear). I don’t think it’s a matter of just having more money.
I don’t know anyone like this. If you do, please intro me! (I met someone vaguely in this category and helped them to get an FTX grant at the start of November.… But they only tangentially fit your description). I’m pretty unconvinced there’s many people like this out there who could be redirected to productively do what I consider good interp work—beyond just motivation and interest in doing independent-ish work, there’s also significant considerations of research taste, having mentorship to do work I think is important, etc.
Seems good, I’d be excited about this happening. I consider my MATS scholars to be vaguely in the spirit of this, and I’ve been very impressed with them. But, like, this is so not bottlenecked on money. It’s a substantial program that would take effort to run, it’s not clear to me that these people would do good work without mentorship (1/month might be sufficient), it’s not clear that this adds much value beyond existing independent researcher grants, etc. But I do think it’s a decent idea—if anyone is interested in making this happen, please reach out!
There’s some work I think is cool, but it tends to be concentrated in a small handful of actually good labs (eg I like ROME and Emergent World Representations a lot). There’s a bunch of work I think isn’t great, but sometimes has great gems in it. But honestly I think that well over a majority of impact weighted TAIS work was done by the TAIS community (specifically, Chris Olah + collaborator’s work is quite possibly a majority in my mind). I’d be interested in being pointed to work that you think is great that I’m missing—I personally find literature reviews to be pretty tedious, and think I underinvest in this kind of thing.
More broadly, my position is that engaging with academia is a theory of change, but one of many. It’s a significant investment of time, some people are much better at it than others (eg, I personally just hate writing papers, and am much worse at it than just directly trying to do good research, or write blog posts/educational materials/good tooling), it’s hard to direct in targeted ways, benefits a bunch from legible signalling and credentials, etc. I also think Redwood are more pessimistic on it than I am, and eg I am personally not convinced that trying to get grokking into ICLR was a good use of time and effort (though I hope it was!). I think Redwood are making a pretty reasonable bet here.
As a negative example here, I think Distill was a major investment of effort into influencing academia, including on doing better interp work, and it basically failed as far as I can tell (despite, to my eyes, Distill papers being notably higher quality and more interesting than conference papers)
I want to distinguish two things—putting in the effort to make a write-up really good, and putting in the effort to eg get it accepted at ICLR/ICML/NeurIPS. I am pretty pro making write-ups really good (I personally am not very good at it and try to avoid it where possible, but this is a personal taste not a value judgement). Eg I really like Anthropic interp papers (though am biased) and think the effort put into presentation and clarity is pretty well spent. And I think that part of submitting to a top conference is making things tightly and clearly phrased, having good figures, making them well presented, having good evidence for your results.
IMO the biggest cost is shaping the results and narrative of your work to fit the kind of thing that reviewers look for, and think is good. I broadly think this just isn’t that correlated with what good interp work looks like. I think this can be extremely expensive if you let it shape your research process, choice of projects, etc for “this would make a good publication”. In cases like grokking, I did the research I wanted to do, and we then decided to go for a publication, which I think was basically fine, and much less costly. But it did involve significant reshaping and optimisation of the narrative (I am personally sad that progress measures got into the title lol).
Idk, these are complex questions, and there are people I respect who are way more or less pro academia + publishing than me. I am personally pretty biased against academia and publishing, and this affects my value judgements here.