A Rocket–Interpretability Analogy

1.

4.4% of the US federal budget went into the space race at its peak.

This was surprising to me, until a friend pointed out that landing rockets on specific parts of the moon requires very similar technology to landing rockets in soviet cities.[1]

I wonder how much more enthusiastic the scientists working on Apollo were, with the convenient motivating story of “I’m working towards a great scientific endeavor” vs “I’m working to make sure we can kill millions if we want to”.

2.

The field of alignment seems to be increasingly dominated by interpretability. (and obedience[2])

This was surprising to me[3], until a friend pointed out that partially opening the black box of NNs is the kind of technology that would let scaling labs find new unhobblings by noticing ways in which the internals of their models are being inefficient and having better tools to evaluate capabilities advances.[4]

I wonder how much more enthusiastic the alignment researchers working on interpretability and obedience are, with the motivating story “I’m working on pure alignment research to save the world” vs “I’m building tools and knowledge which scaling labs will repurpose to build better products, shortening timelines to existentially threatening systems”.[5]

3.

You can’t rely on the organizational systems around you to be pointed in the right direction, and there are obvious reasons for commercial incentives to want to channel your idealistic energy towards types of safety work which are dual-use or even primarily capabilities enabling. And for similar reasons, many of the training programs prepare people for the kind of jobs which come with large salaries and prestige, as a flawed proxy for people moving the needle on x-risk.

If you’re genuinely trying to avert AI doom, please take the time to form inside views away from memetic environments[6] which are likely to have been heavily influenced by commercial pressures. Then back-chain from a theory of change where the world is more often saved by your actions, rather than going with the current and picking a job with safety in its title as a way to try and do your part.

  1. ^

    Space Race—Wikipedia:

    It had its origins in the ballistic missile-based nuclear arms race between the two nations following World War II and had its peak with the more particular Moon Race to land on the Moon between the US moonshot and Soviet moonshot programs. The technological advantage demonstrated by spaceflight achievement was seen as necessary for national security and became part of the symbolism and ideology of the time.

  2. ^

    Andrew Critch:

    I hate that people think AI obedience techniques slow down the industry rather than speeding it up. ChatGPT could never have scaled to 100 million users so fast if it wasn’t helpful at all.

    Making AI serve humans right now is highly profit-aligned and accelerant.

    Of course, later when robots could be deployed to sustain an entirely non-human economy of producers and consumers, there will be many ways to profit — as measured in money, materials, compute, energy, intelligence, or all of the above — without serving any humans. But today, getting AI to do what humans want is the fastest way to grow the industry.

  3. ^

    These paradigms do not seem to be addressing the most fatal filter in our future: Strongly coherent goal-directed agents forming with superhuman intelligence. These will predictably undergo a sharp left turn and the soft/​fuzzy alignment techniques which worked at lower power levels fail simultaneously and as the system reaches high enough competence to reflect on itself, its capabilities, and the guardrails we built.

    Interpretability work could plausibly help with weakly aligned weakly superintelligent systems that do our alignment homework for the much more capable systems to come. But the effort going into this direction seems highly disproportionate to how promising it is, is not backed by plans to pivot to using these systems to do a quite different style of alignment research that’s needed, and generally lacks research closure to avert capabilities externalities.

  4. ^

    From the team that broke the quadratic attention bottleneck:

    Simpler sub-quadratic designs such as Hyena, informed by a set of simple guiding principles and evaluation on mechanistic interpretability benchmarks, may form the basis for efficient large models.

  5. ^

    Ask yourself: “Who will cite my work?”, not “Can I think of a story where my work is used for good things?”

    There is work in these fields which might be good for x-risk, but you need to figure out if what you’re doing is in that category to be good for the world.

  6. ^

    Humans are natural mimics, we copy the people who have visible signals of doing well, because those are the memes which are likely to be good for our genes, and genes direct where we go looking for memes.

    Wealth, high confidence that they’re doing something useful, being part of a growing coalition; great signs of good memes. All much more possessed by people in the interpretability/​obedience kind of alignment than the old-school “this is hard and we don’t know what we’re doing, but it’s going to involve a lot of careful philosophy and math” crowd.

    Unfortunately, this memetic selection is not particularly adaptive for trying to solve alignment.

Crossposted from LessWrong (149 points, 31 comments)