Eliezer Yudkowsky periodically complains about people coming up with questionable plans with questionable assumptions to deal with AI, and then either:

Saying “well, if this assumption doesn’t hold, we’re doomed, so we might as well assume it’s true.”
Worse: coming up with cope-y reasons to assume that the assumption isn’t even questionable at all. It’s just a pretty reasonable worldview.

Sometimes the questionable plan is “an alignment scheme, which Eliezer thinks avoids the hard part of the problem.” Sometimes it’s a sketchy reckless plan that’s probably going to blow up and make things worse.

Some people complain about Eliezer being a doomy Negative Nancy who’s overly pessimistic.

I had an interesting experience a few months ago when I ran some beta-tests of my Planmaking and Surprise Anticipation workshop, that I think are illustrative.

i. Slipping into a more Convenient World

I have an exercise where I give people the instruction to play a puzzle game (“Baba is You”), but where you normally have the ability to move around and interact with the world to experiment and learn things, instead, you need to make a complete plan for solving the level, and you aim to get it right on your first try.

In the exercise, I have people write down the steps of their plan, and assign a probability to each step.

If there is a part of the puzzle-map that you aren’t familiar with, you’ll have to make guesses. I recommend making 2-3 guesses for how a new mechanic might work. (I don’t recommend making a massive branching tree for every possible eventuality. For the sake of the exercise not taking forever, I suggest making 2-3 branching path plans)

Several months ago, I had three young-ish alignment researchers do this task (each session was a 1-1 with just me and them).

The participants varied in how experienced they were with Baba is You. Two of them were new to the game, and completed the the first couple levels without too much difficulty, and then got to a harder level. The third participant had played a bit of the game before, and started with a level near where they had left off.

Each of them looked at their level for awhile and said “Well, this looks basically impossible… unless this [questionable assumption I came up with that I don’t really believe in] is true. I think that assumption is… 70% likely to be true.”

Then they went an executed their plan.

It failed. The questionable assumption was not true.

Then, each of them said, again “okay, well here’s a different sketchy assumption that I wouldn’t have thought was likely except if it’s not true, the level seems unsolveable.”

I asked “what’s your probability for that one being true?”

“70%”

“Okay. You ready to go ahead again?” I asked.

“Yep”, they said.

They tried again. The plan failed again.

And, then they did it a third time, still saying ~70%.

This happened with three different junior alignment researchers, making a total of 9 predictions, which were wrong 100% of the time.

(The third guy, on the the second or third time, said “well… okay, I was wrong last time. So this time let’s say it’s… 60%.”)

My girlfriend ran a similar exercise with another group of young smart people, with similar results. “I’m 90% sure this is going to work” … “okay that didn’t work.”

Later I ran the exercise again, this time with a mix of younger and more experienced AI safety folk, several of whom leaned more pessimistic. I think the group overall did better.

One of them actually made the correct plan on the first try.

One them got it wrong, but gave an appropriately low estimate for themselves.

Another of them (call them Bob) made three attempts, and gave themselves ~50% odds on each attempt. They went into the experience thinking “I expect this to be hard but doable, and I believe in developing the skill of thinking ahead like this.”

But, after each attempt, Bob was surprised by how out-of-left field their errors were. They’d predicted they’d be surprised… but they were surprised in surprising ways – even in a simplified, toy domain that was optimized for being a solveable puzzle, where they had lots of time to think through everything. They came away feeling a bit shook up by the experience, and not sure if they believed in longterm planning at all, and feeling a bit alarmed at a lot of people around who confidently talked as if they were able to model things multiple steps out.

ii. Finding traction in the wrong direction.

A related (though distinct) phenomena I found, in my own personal experiments using Baba Is You, or Thinking Physics, or other puzzle exercises as rationality training:

It’s very easy to spend a lot of time optimizing within the areas I feel some traction, and then eventually realize this was wasted effort. A few different examples:

Forward Chaining instead of Back Chaining.

In Baba-is-You levels, there will often be parts of the world that are easy to start fiddling around with and manipulating, and maneuvering into a position that looks like it’ll help you navigate the world. But, often, these parts are red herrings. They open up your option-space within the level… but not the parts you needed to win.

It’s often faster to find the ultimately right solution if you’re starting from the end and backchaining, rather than forward chaining with whatever bits are easiest to fiddle around with.

Moving linearly, when you needed to be exponential.

Often in games I’ll be making choices that improve my position locally, and are clearly count as some degree of “progress.” I’ll get 10 extra units of production, or damage. But, then I reach the next stage, and it turns out I really needed 100 extra units to survive. And the thought-patterns that would have been necessary to “figure out how to get 100 units” on my first playthrough are pretty different from the ones I was actually doing.

It should have occurred to me to ask “will the game ever throw a bigger spike in difficulty at me?”, and “is my current strategy of tinkering around going to prepare me for such difficulty?”.

Doing lots of traction-y-feeling reasoning that just didn’t work.

On my first Thinking Physics problem last year, I brainstormed multiple approaches to solving the problem, and tried each of them. I reflected on considerations I might have missed, and then incorporated them. I made models and estimations. It felt very productive and reasonable.

I got the wrong answer, though.

My study partner did get the right answer. Their method was more oriented around thought experiments. And in retrospect their approach seemed more useful for this sort of problem. And it’s noteworthy that my subjective feeling of “making progress” didn’t actually correspond to making the sort of progress that mattered.

Takeaways

Obviously, an artificial puzzle is not the same as a real, longterm research project. Some differences include:

It’s designed to be solveable
But, also, it’s designed to be sort of counterintuitive and weird
It gives you a fairly constrained world, and tells you what sort of questions you’re trying to ask.
It gives you clear feedback when you’re done.

Those elements push in different directions. Puzzles are more deliberately counterintuitive than reality is, on average, so it’s not necessarily “fair” when you fall for a red herring. But they are nonetheless mostly easier and clearer than real science problem.

What I found most interesting was people literally saying the words out loud, multiple times “Well, if this [assumption] isn’t true, then this is impossible” (often explicitly adding “I wouldn’t [normally] think this was that likely… but...”). And, then making the mental leap all the way towards “70% that this assumption is true.” Low enough for some plausible deniability, high enough to justify giving their plan a reasonable likelihood of success.

It was a much clearer instance of mentally slipping sideways into a more convenient world, than I’d have expected to get.

I don’t know if the original three people had done calibration training of any kind beforehand. I know my own experience doing the OpenPhil calibration game was that I got good at it within a couple hours… but that it didn’t transfer very well to when I started making PredictionBook / Fatebook questions about topics I actually cared about.

I expect forming hypotheses in a puzzle game to be harder than the OpenPhil Calibration game, but easier than making longterm research plans. It requires effort to wrangle your research plans into a bet-able form, and then actually make predictions about it. I bet most people do not do that.

Now, I do predict that people who do real research in a given field will get at least decent at implicitly predicting research directions within their field (via lots of trial-and-error, and learning from mentors). This is what “research taste” is. But, I don’t think this is that reliable if you’re not deliberately training your calibration. (I have decades of experience passively predicting stuff happening in my life, but I nonetheless was still miscalibrated when I first started making explicit PredictionBook predictions about them).

And moreover, I don’t think this transfers much to new fields you haven’t yet mastered. Stereotypes come to mind of brilliant physicists who assume their spherical-cow-simplifications will help them model other fields.

This seems particularly important for existentially-relevant alignment research. We have examples of people who have demonstrated “some kind of traction and results” (for example, doing experiments on modern ML systems. Or, for that matter, coming up with interesting ideas like Logical Induction). But we don’t actually have direct evidence that this productivity will be relevant to superintelligent agentic AI.

When it comes to “what is good existential safety research taste?”, I think we are guessing.

I think you should be scared about this, if you’re the sort of theoretic researcher, who’s trying to cut at the hardest parts of the alignment problem (whose feedback loops are weak or nonexistent)

I think you should be scared about this, if you’re the sort of Prosaic ML researcher who does have a bunch of tempting feedback loops for current generation ML, but a) it’s really not clear whether or how those apply to aligning superintelligent agents, b) many of those feedback loops also basically translate into enhancing AI capabilities and moving us toward a more dangerous world.

I think you should be scared about this, if you’re working in policy, either as a research wonk or an advocate, where there are some levers of power you can sort-of-see, but how the levers fit together and whether they actually connect to longterm existential safety is unclear.

Unfortunately, “be scared” isn’t that useful advice. I don’t have a great prescription for what to do.

My dissatisfaction with this situation is what leads me to explore Feedbackloop-first Rationality, basically saying “Well the problem is our feedback loops suck – either they don’t exist, or they are temptingly goodharty. Let’s try to invent better ones.” But I haven’t yet achieved an outcome here I can point to and say “okay this clearly helps.”

But, meanwhile, my own best guess is:

I feel a lot more hopeful about researchers who have worked on a few different types of problems, and gotten more calibrated on where the edges of their intuitions’ usefulness are. I’m exploring the art of operationalizing cruxy predictions, because I hope that can eventually feed into an the art of having calibrated, cross-domain research taste, if you are deliberately attempting to test your transfer learning.

I feel more hopeful about researchers that make lists of their foundational assumptions, and practiced of staring into the abyss, confronting “what would I actually do if my core assumptions were wrong, and my plan doesn’t work?”, and grieving for assumptions that seem, on reflection, to have been illusions.

I feel more hopeful about researchers who talk to mentors with different viewpoints, learning different bits of taste and hard-earned life lessons, and attempt to integrate them into some kind of holistic AI safety research taste.

And while I don’t think it’s necessarily right for everyone to set themselves the standard of “tackle the hardest steps in the alignment problem and solve it in one go”, I feel much more optimistic about people who have thought through “what are all the sorts of things that need to go right, for my research to actually pay off in an existential safety win?”

And I’m hopeful by people who look at all of this advice, and think “well, this still doesn’t actually feel sufficient for me to be that confident my plans are really going to accomplish anything”, and set out to brainstorm new ways to shore up their chances.

Optimistic Assumptions, Longterm Planning, and “Cope”

i. Slipping into a more Convenient World

ii. Finding traction in the wrong direction.

Takeaways