List #3: Why not to assume on prior that AGI-alignment workarounds are available

A friend in technical AI Safety shared a list of cruxes for their next career step.

One premise implicit in their writing was that a long-term safe AGI can be build in the first place (or that we should resolve to build safe AGI, since if AGI is inevitable then there is almost nothing else we can do but to resolve to make it safe).

Copy-pasting a list I wrote in response (with light edits):

Why long-term AGI safety might be an unsolvable problem:
Most specifiable problems are unsolvable:
On prior, most engineering problems we can specify (as combinations of desiderata) are impossible to solve.

Survivorship bias of solvable problems:
There are of course higher-level engineering principles for picking what’s solvable, but in practice a lot of engineering looks like tinkering and failing (most inventions seem to have come out such a process).

The engineering problems that we hear about later are mostly ones that turned out possible (or are ones that people have kept on working/trying to solve). So there is a survivorship bias in that most of the problems engineers hear about appear to be solvable (on top of that, there is a motivated focus on remembering the problems that turned out possible after all; try googling “impossible problems” – I was surprised).

Outside the reference class:
Would we presume that one of the most complex, intractable and dangerous engineering problems we could think of – long-term AGI safety – would in fact be solvable?

Neglected research/portfolio diversification:
Few AI safety researchers though raised the question whether any dynamic described by an AGI threat model falls outside a theoretical limit of controllability, ie. whether given unsafe dynamic is uncontrollable. A sequence in which new research is often done: specify a novel threat model, jump to trying to solve it, and then refute solutions that turned out unsound (done either by multiple people, or all in one person all in one person).

So given that the majority of the (short, non-reasoned-through) claims I’ve at least read from AIS researchers on this crucial consideration are that technical AGI alignment/safety is possible, what direction should we expect researchers’ beliefs here to move on average if they open-mindedly and rigorously researched this consideration?

Unreliable intuitions:
Nor can we rely on the confidently voiced intuition that long-term AGI safety is possible in principle.

That would beg the question: under what principle? (or what do people even mean with ‘in principle’?)

Founder effects and arguments from authority:
Nor can we take comfort in any founder researcher still saying that they know AGI safety to be possible.

Thirty years into running a program to secure the foundations of mathematics, David Hilbert concluded “We must know. We will know!” By then Kurt Gödel had constructed his first incompleteness theorem. Still, Hilbert kept the quote for his gravestone.

History of young men resolving to do the impossible:
Nor can we rely on the hope that if we try long enough (another 14 years?), maybe AGI safety turns out possible after all.

Historically, researchers and engineers tried over decades, if not millenia, to solve impossible problems:
perpetual motion machines that both conserve and disperse energy.
singular methods for ‘squaring the circle’, ‘doubling the cube’ or ‘trisecting the angle’.
formal axiomatic systems that are consistent, complete and decidable.
distributed data stores where messages of data are consistent in their content. and also continuously available in a network that is also tolerant to partitions.
uniting general relativity and quantum mechanics into some local variable theory.

...until some bright outsider proved by contradiction that the combination of desiderata is unsolvable based on the (empirically falsified) laws of physics or (formally verified) transformations of axioms.

Resemblances of AGI to a perpetual motion machine:
Forrest, the researcher I’m working with, referred to the ‘Aligned AGI’ idea as seeking to build a ‘Perpetual General Benefit Machine’.

As in the notion both shared by people involved at AGI R&D labs and in the AGI-alignment community that they can build a machine:
that operates into perpetuity,
self-learns internal code and self-modifies underlying hardware (ie. initialising new internal code variants and connecting up new standardised parts for storing, processing and transmitting encoded information), and
autonomously enacts changes modelled internally over domains across the external contexts of the global environment
where all (changing) interactions of the machine’s (changing) internal components with connected surroundings of the (changing) environment...
are being aligned and kept in alignment in how they function with relevant metrics that are optimised toward the ‘benefit’ (and therefore also the (stochastically) guaranteed safety/survival) of all humans living everywhere for all time (ie. not just for original developers/researchers/executives/investors).

To clarify: I am not writing up *conclusive* reasoning for each alternative claim listed above. These are basic obvious reasons that might lead you to start questioning core beliefs underlying the notion that working on understanding neural networks is a way forward.