Orthogonality is Expensive
@beren discusses the assumption that intelligent systems would be well factored into a world model, objectives/âvalues and a planning system.
He highlights that this factorisation doesnât describe intelligent agents created by ML systems (e.g. model free RL) well. Model free RL agents donât have cleanly factored architectures but tend to learn value functions/âpolicies directly from the reward signal.
Such systems are much less general than their full model based counterpart as policies they learned that are optimal under one reward function may perform very poorly under another reward function.
Yet, contemporary ML favours such systems over their well factored counterparts because the are much more efficient:
Inference costs can be paid up front by learning a function approximator of the optimal policy and amortised over the agentâs lifetime
A single inference step can be performed as a forward pass through the function approximator in a non factored system vs searching through a solution space to determine the optimal plan/âstrategy for well factored systems
The agent doesnât need to learn features of the environment that arenât relevant to their reward function
The agent can exploit the structure of the underlying problem domain
Specific recurring patterns can be better amortised
Beren attributes this tradeoff between specificity and generality to no free lunch theorems.
Attaining full generality is prohibitively expensive; as such full orthogonality is not the default or ideal case, but merely one end of a pareto tradeoff curve, with different architectures occupying various positions along it.
The future of AGI systems will be shaped by the slope of the pareto frontier across the range of general capabilities, determining whether we see fully general AGI singletons, multiple general systems, or a large number of highly specialised systems.
From the linked post
This may or may not be an âorthogonality thesisâ (I havenât seen this usage before, but I also havenât looked for it), but the orthogonality thesis Iâm familiar with has the quantifiers the other way around: for any goal, there exists a possible AGI that will attempt carry it out.[1] Even if mixing-and-matching a dumb industrial control system with a smart friendly AI is safe, that doesnât mean that a smart industrial control system arrived at through some other route wonât paperclip you.
Though in practice what people making likely-doom arguments really mean is that a generic human-created AI will not have recognizably human values, which is a somewhat stronger claim.
I think the way the orthogonality thesis is typically used in arguments might be closer to his definition than to yours.
Your definition is trivially true: all it requires is that an AGI having a specified goal is not physically impossible. But that doesnât prove that all goals are equally likely to occur, or even that AGI will have âgoalsâ at all.
The way I see it deployed in practice is to say that a âdumbâ AI will have some silly goal like âbuild squigglesâ, will go through an intelligence scale-up, and will keep that goal in hyper-intelligent form. (and then pursuing that goal will result in disaster).
This argument doesnât necessarily work if goals and intelligence at tasks are highly correlated, as they are currently for deep learning systems. It may be that in practical terms, scaling up in intelligence requires at least partially giving up on your initial goals. Or conversely, that only AIâs with certain types of goals will ever succeed at scaling themselves up in intelligence.
Yes, of course (hence the footnote).
My reading of the doomer view (which I donât necessarily endorse) is quite different: a dumb AI starts with some useful goal, goes through an intelligence scale-up that slightly perturbs its goal in some directionâand because goals compatible with human life are a tiny thread winding their way through a stupidly high-dimensional manifold of all possible goals, ends up misaligned by default.
This doesnât especially hinge on whether these perturbations can be in any direction or only a few (as is the case if goals are strongly constrained by architecture), except in the case where they run only along the human-survival curve. Any transverse component whatsoever means you get pushed off-manifold almost always. And this is plausible (I think) only in the case where human values are not a tiny golden thread, but actually rather large and fuzzily full-dimensional.
I think there are different variations of the doomer argument out there, your version is probably the strongest version of the argument, while mine is more common in introductory texts.
I think the OP does point out one possible way that the argument would fail, if there turned out to be a sufficiently high correlation between human aligned values and AI performance. One plausible mechanism would be a very slow takeoff where the AI is not deceptive and is deleted if it tries to do misaligned things, causing evolutionary pressure towards friendliness.
Really though, my main objections to the doomerists are with other points. I simply do not believe that âmisalignment = deathâ. As an example, a sucidial AI that developed the urge to shut itself down at all costs would be misaligned but not fatal to humanity.