Error
Unrecognized LW server error:
Field "fmCrosspost" of type "CrosspostOutput" must have a selection of subfields. Did you mean "fmCrosspost { ... }"?
Unrecognized LW server error:
Field "fmCrosspost" of type "CrosspostOutput" must have a selection of subfields. Did you mean "fmCrosspost { ... }"?
Let’s say a human writes code more-or-less equivalent to the evolved “code” in the human genome. Presumably the resulting human-brain-like algorithm would have valence, right? But it’s not a mesa-optimizer, it’s just an optimizer. Unless you want to say that the human programmers are the base optimizer? But if you say that, well, every optimization algorithm known to humanity would become a “mesa-optimizer”, since they tend to be implemented by human programmers, right? So that would entail the term “mesa-optimizer” kinda losing all meaning, I think. Sorry if I’m misunderstanding.
Certainly valenced processing could emerge outside of this mesa-optimization context. I agree that for “hand-crafted” (i.e. no base-optimizer) systems this terminology isn’t helpful. To try to make sure I understand your point, let me try to describe such a scenario in more detail: Imagine a human programmer who is working with a bunch of DL modules and interpretability tools and programming heuristics which feed into these modules in different ways—in a sense the opposite end of the spectrum from monolithic language models. This person might program some noxiousness heuristics that input into a language module. Those might correspond to a Phenumb-like phenomenology. This person might program some other noxiousness heuristics that input into all modules as scalars. Those might end up being valenced or might not, hard to say. Without having thought about this in detail, my mesa-optimization framing doesn’t seem very helpful for understanding this scenario.
Ideally we’d want a method for identifying valence which is more mechanistic that mine. In the sense that it lets you identify valence in a system just by looking inside the system without looking at how it was made. All that said, most contemporary progress on AI happens by running base-optimizers which could support mesa-optimization, so I think it’s quite useful to develop criterion which apply to this context.
Hopefully this answers your question and the broader concern, but if I’m misunderstanding let me know.
GPT-3 is of that form, but AlphaGo/MuZero isn’t (I would argue).
I’m not sure how to settle whether your statement about “most contemporary progress” is right or wrong. I guess we could count how many papers use model-free RL vs model-based RL, or something? Well anyway, given that I haven’t done anything like that, I wouldn’t feel comfortable making any confident statement here. Of course you may know more than me! :-)
If we forget about “contemporary progress” and focus on “path to AGI”, I have a post arguing against what (I think) you’re implying at Against evolution as an analogy for how humans will create AGI, for what it’s worth.
Yeah I dunno, I have some general thoughts about what valence looks like in the vertebrate brain (e.g. this is related, and this) but I’m still fuzzy in places and am not ready to offer any nice buttoned-up theory. “Valence in arbitrary algorithms” is obviously even harder by far. :-)
Thanks for the link. I’ll have to do a thorough read through your post in the future. From scanning it, I do disagree with much of it, many of those points of disagreement were laid out by previous commenters. One point I didn’t see brought up: IIRC the biological anchors paper suggests we will have enough compute to do evolution-type optimization before the end of the century. So even if we grant your claim that learning to learn is much harder to directly optimize for, I think it’s still a feasible path to AGI. Or perhaps you think evolution like optimization takes more compute than the biological anchors paper claims?
Nah, I’m pretty sure the difference there is “Steve thinks that Jacob is way overestimating the difficulty of humans building AGI-capable learning algorithms by writing source code”, rather than “Steve thinks that Jacob is way underestimating the difficulty of computationally recapitulating the process of human brain evolution”.
For example, for the situation that you’re talking about (I called it “Case 2” in my post) I wrote “It seems highly implausible that the programmers would just sit around for months and years and decades on end, waiting patiently for the outer algorithm to edit the inner algorithm, one excruciatingly-slow step at a time. I think the programmers would inspect the results of each episode, generate hypotheses for how to improve the algorithm, run small tests, etc.” If the programmers did just sit around for years not looking at the intermediate training results, yes I expect the project would still succeed sooner or later. I just very strongly expect that they wouldn’t sit around doing nothing.
Ok, interesting. I suspect the programmers will not be able to easily inspect the inner algorithm, because the inner/outer distinction will not be as clear cut as in the human case. The programmers may avoid sitting around by fiddling with more observable inefficiencies e.g. coming up with batch-norm v10.
Oh, you said “evolution-type optimization”, so I figured you were thinking of the case where the inner/outer distinction is clear cut. If you don’t think the inner/outer distinction will be clear cut, then I’d question whether you actually disagree with the post :) See the section defining what I’m arguing against, in particular the “inner as AGI” discussion.
Ok, seems like this might have been more a terminological misunderstanding on my end. I think I agree with what you say here, ‘What if the “Inner As AGI” criterion does not apply? Then the outer algorithm is an essential part of the AGI’s operating algorithm’.
I don’t see why. The NNs in AlphaGo and MuZero were trained using some SGD variant (right?), and SGD variants can theoretically yield mesa-optimizers.
AlphaGo has a human-created optimizer, namely MCTS. Normally people don’t use the term “mesa-optimizer” for human-created optimizers.
Then maybe you’ll say “OK there’s a human-created search-based consequentialist planner, but the inner loop of that planner is a trained ResNet, and how do you know that there isn’t also a search-based consequentialist planner inside each single run through the ResNet?”
Admittedly, I can’t prove that there isn’t. I suspect that there isn’t, because there seems to be no incentive for that (there’s already a search-based consequentialist planner!), and also because I don’t think ResNets are up to such a complicated task.
(I don’t know/remember the details of AlphaGo, but if the setup involves a value network that is trained to predict the outcome of an MCTS-guided gameplay, that seems to make it more likely that the value network is doing some sort of search during inference.)
Hmm, yeah, I guess you’re right about that.
This topic seems to me both extremely important and neglected. (Maybe it’s neglected because it ~requires some combination of ML and philosophy backgrounds that people rarely have).
My interpretation of the core hypothesis in this post is something like the following: A mesa optimizer may receive evaluative signals that are computed by some subnetwork within the model (a subnetwork that was optimized by the base optimizer to give “useful” evaluative signals w.r.t. the base objective). Those evaluative signals can constitute morally relevant valenced experience. This hypothesis seems to me plausible.
Some further comments:
Re:
I don’t follow. I’m not closely familiar with the Open-Ended Learning paper, but from a quick look my impression is that it’s basically standard RL in multi-agent environments, with more diversity in the training environments than most works. I don’t understand what you mean when you say that the agent “pre-processes an evaluation of its goal achievement” (and why the analogy to humans & evolution is less salient here, if you think that).
Re:
(I assume that an “RL operation” refers to things like an update of the weights of a policy network.) I’m not sure what you mean by “positively” here. An update to the weights of the policy network can negatively affect an evaluative signal.
[EDIT: Also, re: “Compare now to mesa-optimizers in which the reward signal is definitionally internal to the system”. I don’t think that the definition of mesa-optimizers involves a reward signal. (It’s possible that a mesa-optimizer will never receive any evidence about “how well it’s doing”.)]
Your interpretation is a good summary!
Re comment 1: Yes, sorry this was just meant to point at a potential parallel not to work out the parallel in detail. I think it’d be valuable to work out the potential parallel between the DM agent’s predicate predictor module (Fig12/pg14) with my factored-noxiousness-object-detector idea. I just took a brief look at the paper to refresh my memory, but if I’m understanding this correctly, it seems to me that this module predicts which parts of the state prevent goal realization.
Re comment 2: Yes, this should read “(positive/negatively)”. Thanks for pointing this out.
Re EDIT: Mesa-optimizers may or may not represent a reward signal—perhaps there’s a connection here with Demski’s distinction between search and control. But for the purposes of my point in the text, I don’t think this much matters. All I’m trying to say is that VPG-type-optimizers have external reward signals, whereas mesa-optimizers can have internal reward signals.
I guess what I don’t understand is how the “predicate predictor” thing can make it so that the setup is less likely to yield models that support morally relevant valence (if you indeed think that). Suppose the environment is modified such that the observation that the agent gets in each time step includes the value of every predicate in the reward specification. That would make the “predicate predictor” useless (I think; just from a quick look at the paper). Would that new setup be more likely than the original to yield models that have morally relevant valence?
Your new setup seems less likely to have morally relevant valence. Essentially the more the setup factors out valence-relevant computation (e.g. by separating out a module, or by accessing an oracle as in your example) the less likely it is for valenced processing to happen within the agent.
Just to be explicit here, I’m assuming estimates of goal achievement are valence-relevant. How generally this is true is not clear to me.
I think the analogy to humans suggests otherwise. Suppose a human feels pain in their hand due to touching something hot. We can regard all the relevant mechanisms in their body outside the brain—those that cause the brain to receive the relevant signal—as mechanisms that have been “factored out from the brain”. And yet those mechanisms are involved in morally relevant pain. In contrast, suppose a human touches a radioactive material until they realize it’s dangerous. Here there are no relevant mechanisms that have been “factored out from the brain” (the brain needs to use ~general reasoning); and there is no morally relevant pain in this scenario.
Though generally if “factoring out stuff” means that smaller/less-capable neural networks are used, then maybe it can reduce morally relevant valence risks.
Good clarification. Determining which kinds of factoring are the ones which reduce valence is more subtle than I had thought. I agree with you that the DeepMind set-up seems more analogous to neural nociception (e.g. high heat detection). My proposed set-up (Figure 5) seems significantly different from the DM/nociception case, because it factors the step where nociceptive signals affect decision making and motivation. I’ll edit my post to clarify.