Here’s some thoughts I had while reading, with no particular coherent theme:
The way I see it, there are two kinds of gradient hacking possible. The first is a situation where the solution to the problem the model was trained to solve is an agent, a “mesa optimizer”, that has its own goals that are imperfectly aligned with the goals of the people who trained it and that rediscovers gradient hacking from first principles during its computation. […] The other way I see gradient hacking happening is if there are circuits in the model that simply resist being rewritten by gradient descent.
I think this distinction maps pretty cleanly to a now-forgotten concept in AI alignment, the former being indeed a mesa-optimizer, the second mapping onto optimization daemons. I think these should be given different names, maybe “full gradient hacker” and “internal gradient hacker”? A big difference is that a system could have multiple internal gradient hackers. Maybe it’s just a question about the level we’re looking at, and whether the hacker is short-/long-term beneficial/detrimental to itself/the supersystem?
Internal gradient hackers have been observed in non-neural network systems, for example in Eurisko, where a heuristic assigned itself as the discoverer of other heuristics, resulting in a very high Worth. I don’t think we’ve seen something like this in the context of neural networks, but I could imagine circuits copying themselves “backwards” through the network and mutating along the way. I guess the fact that there’s no recurrence (yet…) in advanced ML models is a big advantage.
Here’s the relevant passage:
One of the first heuristics that ᴇᴜʀɪꜱᴋᴏ synthesized (H59) quickly attained nearly the highest Worth possible (999). Quite excitedly, we examined it and could not understand at first what it was doing that was so terrific. We monitored it carefully, and finally realized how it worked: whenever a new conjecture was made with high worth, this rule put its own name down as one of the discoverers! It turned out to be particularly difficult to prevent this generic type of finessing of ᴇᴜʀɪꜱᴋᴏ′s evaluation mechanism. Since the rules had full access to ᴇᴜʀɪꜱᴋᴏ′s code, they would have access to any safeguards we might try to implement. We finally opted for having a small ‘meta-level’ of protected code that the rest of the system could not modify.
—Douglas B. Lenat, “ᴇᴜʀɪꜱᴋᴏ: A Program That Learns New Heuristics and Domain Concepts” p. 30, 1983
There is no direct analogy to recombination in gradient descent.
I’m not sure this is completely true, though I have to think a bit more about it. There’s techniques like dropout, which make training more robust, and in the context of an internal gradient hacker this would probably change parts of the hacker while leaving other parts untouched, which makes it much more difficult for reliable internal communication. I guess it would also provide an incentive for an internal gradient hacker to “evolve” internal redundancy & modularity, which we don’t want.
I also know that people have observed that swapping layers of neural networks doesn’t have a very large effect; I don’t think this is used as a training technique but it could be.
Paternal/maternal genome exclusion. This is a real thing that can happen where one parent’s genetic material is either silenced or rejected entirely at an early stage of development. It can lead to parthenogenesis. The short-term advantage of this is that the included parent’s genes are 100% represented in each offspring. The longterm disadvantage is having mutations accumulate.
I knew it! I’ve been wondering about this for literally years, thanks for confirming that this is a thing that happens.
The examples of gradient hackers with positive effects seem like they could be following the pattern of “here’s a sub-system doing something bad (e.g. transposons copying themselves incessantly), which the system needs to defend against, so the system finds a way (e.g. introns) to defend which carries other (maybe greater) benefits but which wouldn’t have been found otherwise”, does that seem like it explains things?
The examples of gradient hackers with positive effects seem like they could be following the pattern of “here’s a sub-system doing something bad (e.g. transposons copying themselves incessantly), which the system needs to defend against, so the system finds a way (e.g. introns) to defend which carries other (maybe greater) benefits but which wouldn’t have been found otherwise”, does that seem like it explains things?
Yes, this is broadly accurate from my knowledge of positive examples (for the organism) of drive. They either contribute more scratch (TEs) or they drive through a nifty innovation (homing endonucleases for mating type switching in yeast, VJD recombination in immune cells) that can be coopted. It’s possible there are other positive contributions that we don’t know about, of course.
I knew it! I’ve been wondering about this for literally years, thanks for confirming that this is a thing that happens.
The coolest example is Cupressus dupreziana, the androgenetic cypress. It’s hard to observe a history of extinctions from meiotic drive, bc it’s not a cause of death that fossilizes, but this one we’re seeing just right before it completes. When I learned about this, there were only 28 individuals left in this species. Genome Exclusion is covered in chapter 10 of Burt & Trivers.
Re:analogies to recombination, I did think as I was preparing these old notes to post that possibly I should see the cost function or the task being trained on as somewhat analogous in the sense that they are sort of templates against which performance is being checked? It’s a very tenuous thought and I can’t quite make the analogy work, but maybe you or someone else can do something with it.
Awesome post. Loved it.
Here’s some thoughts I had while reading, with no particular coherent theme:
I think this distinction maps pretty cleanly to a now-forgotten concept in AI alignment, the former being indeed a mesa-optimizer, the second mapping onto optimization daemons. I think these should be given different names, maybe “full gradient hacker” and “internal gradient hacker”? A big difference is that a system could have multiple internal gradient hackers. Maybe it’s just a question about the level we’re looking at, and whether the hacker is short-/long-term beneficial/detrimental to itself/the supersystem?
Internal gradient hackers have been observed in non-neural network systems, for example in Eurisko, where a heuristic assigned itself as the discoverer of other heuristics, resulting in a very high Worth. I don’t think we’ve seen something like this in the context of neural networks, but I could imagine circuits copying themselves “backwards” through the network and mutating along the way. I guess the fact that there’s no recurrence (yet…) in advanced ML models is a big advantage.
Here’s the relevant passage:
—Douglas B. Lenat, “ᴇᴜʀɪꜱᴋᴏ: A Program That Learns New Heuristics and Domain Concepts” p. 30, 1983
I’m not sure this is completely true, though I have to think a bit more about it. There’s techniques like dropout, which make training more robust, and in the context of an internal gradient hacker this would probably change parts of the hacker while leaving other parts untouched, which makes it much more difficult for reliable internal communication. I guess it would also provide an incentive for an internal gradient hacker to “evolve” internal redundancy & modularity, which we don’t want.
I also know that people have observed that swapping layers of neural networks doesn’t have a very large effect; I don’t think this is used as a training technique but it could be.
I knew it! I’ve been wondering about this for literally years, thanks for confirming that this is a thing that happens.
The examples of gradient hackers with positive effects seem like they could be following the pattern of “here’s a sub-system doing something bad (e.g. transposons copying themselves incessantly), which the system needs to defend against, so the system finds a way (e.g. introns) to defend which carries other (maybe greater) benefits but which wouldn’t have been found otherwise”, does that seem like it explains things?
Yes, this is broadly accurate from my knowledge of positive examples (for the organism) of drive. They either contribute more scratch (TEs) or they drive through a nifty innovation (homing endonucleases for mating type switching in yeast, VJD recombination in immune cells) that can be coopted. It’s possible there are other positive contributions that we don’t know about, of course.
The coolest example is Cupressus dupreziana, the androgenetic cypress. It’s hard to observe a history of extinctions from meiotic drive, bc it’s not a cause of death that fossilizes, but this one we’re seeing just right before it completes. When I learned about this, there were only 28 individuals left in this species. Genome Exclusion is covered in chapter 10 of Burt & Trivers.
Re:analogies to recombination, I did think as I was preparing these old notes to post that possibly I should see the cost function or the task being trained on as somewhat analogous in the sense that they are sort of templates against which performance is being checked? It’s a very tenuous thought and I can’t quite make the analogy work, but maybe you or someone else can do something with it.