Safety Researcher and Scalable Alignment Team lead at DeepMind. AGI will probably be wonderful; let’s make that even more probable.
Geoffrey Irving
Is Holden still on the board?
I’m the author of the cited AI safety needs social scientists article (along with Amanda Askell), previously at OpenAI and now at DeepMind. I currently work with social scientists in several different areas (governance, ethics, psychology, …), and would be happy to answer questions (though expect delays in replies).
I lead some of DeepMind’s technical AGI safety work, and wanted to add two supporting notes:
I’m super happy we’re growing strategy and governance efforts!
We view strategy and governance questions as coupled to technical safety, and are working to build very close links between research in the two areas so that governance mechanisms and alignment mechanisms can be co-designed. (This also applies to technical safety and the Ethics Team, among other teams.)
This paper has at least two significant flaws when used to estimate relative complexity for useful purposes. In the authors’ defense such an estimate wasn’t the main motivation of the paper, but the Quanta article is all about estimation and the paper doesn’t mention the flaws.
Flaw one: no reversed control
Say we have two parameterized model classes and , and ask what ns are necessary for to approximate and to approximate . It is trivial to construct model classes for which the n is large in both directions, just because is a much better algorithm to approximate than and vice versa. I’m not sure how much this cuts off the 1000 estimate, but it could easily be 10x.Brief Twitter thread about this: https://twitter.com/geoffreyirving/status/1433487270779174918
Flaw two: no scaling w.r.t. multiple neurons
I don’t see any reason to believe the 1000 factor would remain constant as you add more neurons, so that we’re approximating many real neurons with many (more) artificial neurons. In particular, it’s easy to construct model classes where the factor decays to 1 as you add more real neurons. I don’t know how strong this effect is, but again there is no discussion or estimation of it in the paper.
Ah, I see: you’re going to lean on the difference between “cause” and “control”. So to be clear: I am claiming that, as an empirical matter, we also can’t control the past, or even “control” the past.
To expand, I’m not using physics priors to argue that physics is causal, so we can’t control the past. I’m using physics and history priors to argue that we exist in the non-prediction case relative to the past, so CDT applies.
By “physics-based” I’m lumping together physics and history a bit, but it’s hard to disentangle them especially when people start talking about multiverses. I generally mean “the combined information of the laws of physics and our knowledge of the past”. The reason I do want to cite physics too, even for the past case of (1), is that if you somehow disagreed about decision theorists in WW1 I’d go to the next part of the argument, which is that under the technology of WW1 we can’t do the necessary predictive control (they couldn’t build deterministic twins back then).
However, it seems like we’re mostly in agreement, and you could consider editing the post to make that more clear. The opening line of your post is “I think that you can “control” events you have no causal interaction with, including events in the past.” Now the claim is “everyone agrees about the relevant physics — and in particular, that you can’t causally influence the past”. These two sentences seem inconsistent, and especially since your piece is long and quite technical opening with a wrong summary may confuse people.
I realize you can get out of the inconsistency by leaning on the quotes, but it still seems misleading.
As a high-level comment, it seems bad to structure the world so that the smartest people compete against each other in zero-sum games. It’s definitely the case that zero-sum games are the best way to ensure technical hardness, as the games will by construction be right at the threshold of playability. But if we do this we’re throwing most of the value away in comparison to working on positive-sum games.
Unfortunately, this is unlikely to be an effective use of resources (speaking as someone who worked in high-performance computing for the past 18 years). The resources you can contribute will be dwarfed by the volume and efficiency of cloud services and supercomputers. Even then, due to network constraints the only possible tasks will be embarrassingly parallel computations that do not stress network or memory, and very few scientific computing tasks have this form.
So certainly physics-based priors is a big component, and indeed in some sense is all of it. That is, I think physics-based priors should give you an immediate answer of “you can’t influence the past with high probability”, and moreover that once you think through the problems in detail the conclusion will be that you could influence the past if physics were different (including boundary conditions, even if laws remain the same), but still that boundary condition priors should still tell us you can’t influence the past. I’m happy to elaborate.
First, I think saying CDT is wrong, full stop, is much less useful than saying that CDT has a limited domain of applicability (using Sean Carroll’s terminology from The Big Picture). Analogously, one shouldn’t say that Newtonian physics is wrong, but that it is has a limited domain of applicability, and one should be careful to apply it only in that domain. Of course, you can choose to stick to the “wrong” terminology; the claim is only that this is less useful.
So what’s the domain of applicability of CDT? Roughly, I think the domain is cases where the agent can’t be predicted by other agents in the world. I personally like to call this the “free will” case, but that’s my personal definition, so if you don’t like that definition we can call it the non-prediction case. The deterministic twin case violates this, as there is a dimension of decision making where non-prediction fails: each twin can perfectly predict the other’s actions conditional on their own actions. So deterministic twins are outside the domain of applicability of CDT.
A consequence of this view is whether we are in or out of the domain of applicability of CDT is an empirical question: you can’t resolve it from pure theory. I further claim (without pinning down the definitions very well) that “generic, un-tuned” situations fall into the non-prediction case. This is again an empirical claim, and roughly says that “something needs to happen” to be outside the non-prediction case. In the deterministic twin case, this “something” is the intentional construction of the twins. Some detailed claims:
Humanity’s past fits the non-prediction case. For example, it is not the case that “perhaps you and some of the guards are implementing vaguely similar decision procedures at some level” in World War 1, not least because most of decision theory was invented after World War 1. Again, this is a purely empirical claim: it could have been otherwise, and I’m claiming it wasn’t.
The multiverse fits the non-prediction case. I also believe that once we have a sufficient understanding of cosmology, we will conclude that it is most likely that the multiverse fits the non-prediction case, roughly because the causal linkages behind the multiverse (through quantum branching, inflation, or logical possibilities) are high temperature in some sense. This is an again an empirical prediction about cosmology, though of course it’s much harder to check and I’m much less confident in it than for (1).
The world does not entirely fall into the non-prediction case. As an example, it is perilous when advertisers have too much information and computation asymmetry with users, since that asymmetry can break non-prediction (more here). A consequence of this is that it’s good that people are studying decision theories with larger domains of applicability.
AGI safety v1 can likely be made to fall into the non-prediction case. This is another highly contingent claim, and requires some action to ensure, namely somehow telling AGIs to avoid the non-prediction case in appropriate senses (and designing them so that this is possible to do). (I expect to get jumped on for this one, but before you believe I’m just ignorant it might be worth asking Paul whether I’m just ignorant.) And I do mean v1; it’s quite possible that v2 goes better if we have the option of not telling them this.
I do want to emphasize that as a consequence of (3), (4), uncertainty about (2), and a way tinier amount of uncertainty about (1), I’m happy people are exploring this space. But of course I’m also going to place a lower estimate on its importance as a consequence of the above.
I’m not sure anyone else is going to be brave enough to state this directly, so I’ll do it:
After reading some of this post (and talking to Paul a bunch and Scott a little), I remain unconfused about whether we can control the past.
I think we might just end up in the disaster scenario where you get a bunch of karma. :)
If we want to include a hits-based approach to careers, but also respect people not having EA goals as the exclusive life goal, I’d have a worry that signing this pledge is incompatible with staying in a career that the EA community subsequently decides is ineffective. This could be true even if under the information known at the time of career choice the career looked like terrific expected value.
The actual wording of the pledge seems okay under this metric, as it only promises to “seek out ways to increase the impact of my career”, so maybe this is fine as long as the pledge doesn’t rise to “switch career” in all cases.
Won’t this comment get hidden soon?
As someone who’s worked both in ML for formal verification with security motivations in mind, and (now) directly on AGI alignment, I think most EA-aligned folk who would be good at formal verification will be close enough to being good at direct AGI alignment that it will be higher impact to work directly on AGI alignment. It’s possible this would change in the future if there are a lot more people working on theoretically-motivated prosaic AGI alignment, but I don’t think we’re there yet.
I think that isn’t the right counterfactual since I got into EA circles despite having only minimal (and net negative) impressions of EA-related forums. So your claim is narrowly true, but if instead the counterfactual was if my first exposure to EA was the EA forum, then I think yes the prominence of this kind of post would have made me substantially less likely to engage.
But fundamentally if we’re running either of these counterfactuals I think we’re already leaving a bunch of value on the table, as expressed by EricHerboso’s post about false dilemmas.
I bounce off posts like this. Not sure if you’d consider me net positive or not. :)
Not a non-profit, but since you mention AI and X-risk it’s worth mentioning DeepMind, since program managers are core to how research is organized and led here: https://deepmind.com/careers/jobs/2390893.
5% probability by 2039 seems way too confident that it will take a long time: is this intended to be a calibrated estimate, or does the number have a different meaning?
Yep, that’s the right interpretation.
In terms of hardware, I don’t know how Chrome did it, but at least on fully capable hardware (mobile CPUs and above) you can often bitslice to make almost any circuit efficient if it has to be evaluated in parallel. So my prior is that quite general things don’t need new hardware if one is sufficiently motivated, and would want to see the detailed reasoning before believing you can’t do it with existing machines.
Unfortunately, a significant part of the situation is that people with internal experience and a negative impression feel both constrained and conflicted (in the conflict of interest sense) for public statements. This applies to me: I left OpenAI in 2019 for DeepMind (thus the conflicted).