An overview of some promising work by junior alignment researchers

We’re all familiar with ELK, natural abstractions, and toy models of superposition.

But there’s also a new cohort of alignment researchers. Many of them got involved (or produced their first major pieces of work) in the last year.

I’ve been impressed by the quality of some of their work, and I think it deserves wider recognition. The recent increase in attention being paid to alignment gives me some hope that an “unknown genius” may emerge in the field. Additionally, there are several junior alignment researchers who seem to have a lot of potential, and I’m excited to see their contributions as they get more experience and influence in the field.

Here’s some work by junior alignment researchers that excited me in the last year:

Externalized reasoning oversight by Tamera Lanham

Mechanistic interpretability tries to understand what models “think” by looking at their weights and activations. Externalized reasoning oversight tries to do this by just asking the models to explain their reasoning.

Inspired by chain-of-thought prompting techniques, Tamera set out to see if she could get language models to provide honest and transparent answers about their reasoning processes. In the limit, this technique could provide a novel way for us to interpret language models.

I’m excited to see junior researchers tackle existing problems (e.g., how do we understand what models think?) in new ways (e.g., hm, everyone’s been focusing on weights and activations… are there any alternatives?). Tamera’s work is a great example.

Relevant post: Externalized reasoning oversight: A research direction for language model alignment

Relevant interview: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

Goal misgeneralization by Lauro Langosco

Everyone knows that models can learn unintended goals. Everyone knows that models can fail to generalize out-of-distribution.

Lauro’s paper connects these points and distinguishes between capabilities generalization failures (the model is incompetent out-of-distribution) and goal misgeneralization failures (the model is competent but pursues the wrong goal). He and his colleagues then define goal misgeneralization more concretely in an RL context, discuss its implications for alignment, and demonstrate alignment failures in existing models. Moreover, I found the writing in his paper to be particularly good at striking a balance between (a) explicitly discussing x-risks and (b) presenting arguments rigorously in a format suitable for an ML audience.

I’m excited to see more work that takes well-known problems in alignment and tries to (a) explain them more concretely and (b) examine them in current-day systems. I don’t expect this work to directly solve alignment, but I expect it to help us get a better understanding of alignment problems, find new ways to make progress on these problems, and make it easier for other ML researchers to find alignment problems they can work on.

Relevant paper: Goal misgeneralization in deep reinforcement learning

An overview of the technical alignment landscape by Thomas Larsen & Eli Lifland

Many research fields regularly have literature reviews and meta-analyses. In fact these are often the most widely-read and widely-cited papers.

Alignment hasn’t had much of this. Perhaps this is because the field is small, so “everyone already knows what everyone is working on.” This certainly isn’t true for junior researchers, and my conversations with senior researchers suggest this isn’t even the case for the veterans. (People are generally focused on their research, and they don’t have time to follow everything in the space.)

Earlier this year, Thomas Larsen & Eli Lifland presented the most comprehensive overview of the technical alignment space. In addition to providing summaries, they also provided brief opinions and assessments of various alignment agendas.

I’m excited to see more people who are willing to try to “understand the entire space” and who are bold enough to raise criticisms of work by senior people. This is how research fields grow, and I’m grateful to Larsen & Lifland for providing an excellent example.

Relevant post: (My understanding of) What Everyone in Technical Alignment is Doing and Why

Naive hypotheses by Shoshannah Tekofsky

Have you ever noticed that when you learn more about a field, you start to think more like everyone else? Do you notice concepts and frames like “inner/​outer alignment” or “latent knowledge” or “sharp left turn” popping up in your thoughts?

Shoshannah Tekofsky anticipated this in advance. So, before she read what other alignment researchers thought, she decided to write down her “naive hypotheses” for how to solve the alignment problem. None of her ideas are likely to work (obviously), but I’m impressed by the epistemology she’s bringing to the table. Shoshannah is now reading up on the alignment problem and the various approaches on how to solve it. But if she notices her creativity dipping, she can return to her list of naive hypotheses as an intuition pump.

I’m excited to see more people do exercises like this, where they write down their best-guesses for how to solve alignment. More broadly, I’m excited for junior researchers to engage in more active learning, where they’re more willing to pause/​stop in order to write down their own thoughts/​ideas/​objections to the concepts they’re learning about. Shoshannah embodies this.

​​Relevant post: Naive hypotheses on AI alignment

Relevant interview: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas

A few other spotlights

  • I’m looking forward to seeing more from Vivek Hebbar.

  • I considered putting Shard Theory by Quentin Pope & Alex Turner on the list, but I don’t think the “junior” label feels quite right for them.

  • Steven Byrnes released the Intro to Brain-Like AGI Safety sequence this year; I haven’t read it yet, but a few people I know have found it interesting/​useful.

Caveats

  • This post is biased toward work by people I know (I’m friends with several of them).

  • This post is biased toward work that has been published/​posted On one hand, I think it’s healthy for a research field to have written & sharable outputs, and I generally think the alignment field would be better if people published their work/​thinking more often. However, this list will underrepresented work that is (a) in-progress or (b) not sharable.

  • This post underrepresents work done by junior researchers who are part of large teams or organizations (e.g., junior alignment researchers who contributed meaningfully to Redwood’s causal scrubbing or indirect object identification papers).

  • This list is, of course, not exhaustive. I think it’s better to post an incomplete list than no list at all. Feel free to spotlight additional work in the comments.

Crossposted from LessWrong (34 points, 0 comments)