It’s essentially no cost to run a gradient-based optimizer on a neural network, and I think this is sufficient for good-enough alignment. I view the the interpretability work I do at Eleuther as icing on the cake, allowing us to steer models even more effectively than we already can. Yes, it’s not zero cost, but it’s dramatically lower cost than it would be if we had to crack open a skull and do neurosurgery.
Also, if by “mechanistic interpretability” you mean “circuits” I’m honestly pretty pessimistic about the usefulness of that kind of research, and I think the really-useful stuff is lower cost than circuits-based interp.
If you want to say “it’s a black box but the box has a “gradient” output channel in addition to the “next-token-probability-distribution” output channel”, then I have no objection.
If you want to say ”...and those two output channels are sufficient for safe & beneficial AGI”, then you can say that too, although I happen to disagree.
If you want to say “we also have interpretability techniques on top of those, and they work well enough to ensure alignment for both current and future AIs”, then I’m open-minded and interested in details.
If you want to say “we can’t understand how a trained model does what it does in any detail, but if we had to drill into a skull and only measure a few neurons at a time etc. then things sure would be even worse!!”, then yeah duh.
But your OP said “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost”, and used the term “white box”. That’s the part that strikes me as crazy. To be charitable, I don’t think those words are communicating the message that you had intended to communicate.
For example, find a random software engineer on the street, and ask them: “if I give you a 1-terabyte compiled executable binary, and you can do whatever you want with that file on your home computer, would you describe it as closer to “white box” or “black box”?”. I predict most people would say “closer to black box”, even though they can look at all the bits and step through the execution and run decompilation tools etc. if they want. Likewise you can ask them whether it’s possible to “analyze” that binary “at essentially no cost”. I predict most people would say “no”.
Differentiability is a pretty big part of the white box argument.
The terabyte compiled executable binary is still white box in a minimal sense but it’s going to take a lot of work to mould that thing into something that does what you want. You’ll have to decompile it and do a lot of static analysis, and Rice’s theorem gets in the way of the kinds of stuff you can prove about it. The code might be adversarially obfuscated, although literal black box obfuscation is provably impossible.
If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff. And if I’m worried about the base model being preserved underneath and doing nefarious things, I can generate synthetic data from the fine tuned model and train a fresh network from scratch on that (although to be fair that’s pretty compute-intensive).
I don’t think “mouldability” is a synonym of “white-boxiness”. In fact, I think they’re hardly related at all:
There can be a black box with lots of knobs on the outside that change the box’s behavior. It’s still a black box.
Conversely, consider an old-fashioned bimetallic strip thermostat with a broken dial. It’s not mouldable at all—it can do one and only thing, i.e. actuate a switch at a certain fixed temperature. (Well, I guess you can use it as a doorstop!) But a bimetallic strip thermostat still very white-boxy (after I spend 30 seconds telling you how it works).
You wrote “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.” I feel like I keep pressing you on this, and you keep motte-and-bailey’ing into some other claim that does not align with a common-sense reading of what you originally wrote:
“Well, the cost of analysis could theoretically be even higher—like, if you had to drill into skulls…” OK sure but that’s not the same as “essentially no cost”.
“Well, the cost of analysis may be astronomically high, but there’s a theorem proving that it’s not theoretically impossible…” OK sure but that’s not the same as “essentially no cost”.
“Well, I can list out some specific analysis and manipulation tasks that we can do at essentially no cost: we can do X, and Y, and Z, …” OK sure but that’s not the same as “we can analyze and manipulate however we want at essentially no cost”.
If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff.
But this is irrelevant to the original claim, right? Being able to fine-tune might make introspection on its interal algorithmic representations a bit cheaper, but in practice we observe that it takes us weeks or months of alignment researchers’ time to figure out what extremely tiny slices of two-generations-old LLMs are doing.
It’s essentially no cost to run a gradient-based optimizer on a neural network, and I think this is sufficient for good-enough alignment. I view the the interpretability work I do at Eleuther as icing on the cake, allowing us to steer models even more effectively than we already can. Yes, it’s not zero cost, but it’s dramatically lower cost than it would be if we had to crack open a skull and do neurosurgery.
Also, if by “mechanistic interpretability” you mean “circuits” I’m honestly pretty pessimistic about the usefulness of that kind of research, and I think the really-useful stuff is lower cost than circuits-based interp.
If you want to say “it’s a black box but the box has a “gradient” output channel in addition to the “next-token-probability-distribution” output channel”, then I have no objection.
If you want to say ”...and those two output channels are sufficient for safe & beneficial AGI”, then you can say that too, although I happen to disagree.
If you want to say “we also have interpretability techniques on top of those, and they work well enough to ensure alignment for both current and future AIs”, then I’m open-minded and interested in details.
If you want to say “we can’t understand how a trained model does what it does in any detail, but if we had to drill into a skull and only measure a few neurons at a time etc. then things sure would be even worse!!”, then yeah duh.
But your OP said “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost”, and used the term “white box”. That’s the part that strikes me as crazy. To be charitable, I don’t think those words are communicating the message that you had intended to communicate.
For example, find a random software engineer on the street, and ask them: “if I give you a 1-terabyte compiled executable binary, and you can do whatever you want with that file on your home computer, would you describe it as closer to “white box” or “black box”?”. I predict most people would say “closer to black box”, even though they can look at all the bits and step through the execution and run decompilation tools etc. if they want. Likewise you can ask them whether it’s possible to “analyze” that binary “at essentially no cost”. I predict most people would say “no”.
Differentiability is a pretty big part of the white box argument.
The terabyte compiled executable binary is still white box in a minimal sense but it’s going to take a lot of work to mould that thing into something that does what you want. You’ll have to decompile it and do a lot of static analysis, and Rice’s theorem gets in the way of the kinds of stuff you can prove about it. The code might be adversarially obfuscated, although literal black box obfuscation is provably impossible.
If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff. And if I’m worried about the base model being preserved underneath and doing nefarious things, I can generate synthetic data from the fine tuned model and train a fresh network from scratch on that (although to be fair that’s pretty compute-intensive).
I don’t think “mouldability” is a synonym of “white-boxiness”. In fact, I think they’re hardly related at all:
There can be a black box with lots of knobs on the outside that change the box’s behavior. It’s still a black box.
Conversely, consider an old-fashioned bimetallic strip thermostat with a broken dial. It’s not mouldable at all—it can do one and only thing, i.e. actuate a switch at a certain fixed temperature. (Well, I guess you can use it as a doorstop!) But a bimetallic strip thermostat still very white-boxy (after I spend 30 seconds telling you how it works).
You wrote “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.” I feel like I keep pressing you on this, and you keep motte-and-bailey’ing into some other claim that does not align with a common-sense reading of what you originally wrote:
“Well, the cost of analysis could theoretically be even higher—like, if you had to drill into skulls…” OK sure but that’s not the same as “essentially no cost”.
“Well, the cost of analysis may be astronomically high, but there’s a theorem proving that it’s not theoretically impossible…” OK sure but that’s not the same as “essentially no cost”.
“Well, I can list out some specific analysis and manipulation tasks that we can do at essentially no cost: we can do X, and Y, and Z, …” OK sure but that’s not the same as “we can analyze and manipulate however we want at essentially no cost”.
Do you see what I mean?
But this is irrelevant to the original claim, right? Being able to fine-tune might make introspection on its interal algorithmic representations a bit cheaper, but in practice we observe that it takes us weeks or months of alignment researchers’ time to figure out what extremely tiny slices of two-generations-old LLMs are doing.