Differentiability is a pretty big part of the white box argument.
The terabyte compiled executable binary is still white box in a minimal sense but it’s going to take a lot of work to mould that thing into something that does what you want. You’ll have to decompile it and do a lot of static analysis, and Rice’s theorem gets in the way of the kinds of stuff you can prove about it. The code might be adversarially obfuscated, although literal black box obfuscation is provably impossible.
If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff. And if I’m worried about the base model being preserved underneath and doing nefarious things, I can generate synthetic data from the fine tuned model and train a fresh network from scratch on that (although to be fair that’s pretty compute-intensive).
I don’t think “mouldability” is a synonym of “white-boxiness”. In fact, I think they’re hardly related at all:
There can be a black box with lots of knobs on the outside that change the box’s behavior. It’s still a black box.
Conversely, consider an old-fashioned bimetallic strip thermostat with a broken dial. It’s not mouldable at all—it can do one and only thing, i.e. actuate a switch at a certain fixed temperature. (Well, I guess you can use it as a doorstop!) But a bimetallic strip thermostat still very white-boxy (after I spend 30 seconds telling you how it works).
You wrote “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.” I feel like I keep pressing you on this, and you keep motte-and-bailey’ing into some other claim that does not align with a common-sense reading of what you originally wrote:
“Well, the cost of analysis could theoretically be even higher—like, if you had to drill into skulls…” OK sure but that’s not the same as “essentially no cost”.
“Well, the cost of analysis may be astronomically high, but there’s a theorem proving that it’s not theoretically impossible…” OK sure but that’s not the same as “essentially no cost”.
“Well, I can list out some specific analysis and manipulation tasks that we can do at essentially no cost: we can do X, and Y, and Z, …” OK sure but that’s not the same as “we can analyze and manipulate however we want at essentially no cost”.
If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff.
But this is irrelevant to the original claim, right? Being able to fine-tune might make introspection on its interal algorithmic representations a bit cheaper, but in practice we observe that it takes us weeks or months of alignment researchers’ time to figure out what extremely tiny slices of two-generations-old LLMs are doing.
Differentiability is a pretty big part of the white box argument.
The terabyte compiled executable binary is still white box in a minimal sense but it’s going to take a lot of work to mould that thing into something that does what you want. You’ll have to decompile it and do a lot of static analysis, and Rice’s theorem gets in the way of the kinds of stuff you can prove about it. The code might be adversarially obfuscated, although literal black box obfuscation is provably impossible.
If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff. And if I’m worried about the base model being preserved underneath and doing nefarious things, I can generate synthetic data from the fine tuned model and train a fresh network from scratch on that (although to be fair that’s pretty compute-intensive).
I don’t think “mouldability” is a synonym of “white-boxiness”. In fact, I think they’re hardly related at all:
There can be a black box with lots of knobs on the outside that change the box’s behavior. It’s still a black box.
Conversely, consider an old-fashioned bimetallic strip thermostat with a broken dial. It’s not mouldable at all—it can do one and only thing, i.e. actuate a switch at a certain fixed temperature. (Well, I guess you can use it as a doorstop!) But a bimetallic strip thermostat still very white-boxy (after I spend 30 seconds telling you how it works).
You wrote “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.” I feel like I keep pressing you on this, and you keep motte-and-bailey’ing into some other claim that does not align with a common-sense reading of what you originally wrote:
“Well, the cost of analysis could theoretically be even higher—like, if you had to drill into skulls…” OK sure but that’s not the same as “essentially no cost”.
“Well, the cost of analysis may be astronomically high, but there’s a theorem proving that it’s not theoretically impossible…” OK sure but that’s not the same as “essentially no cost”.
“Well, I can list out some specific analysis and manipulation tasks that we can do at essentially no cost: we can do X, and Y, and Z, …” OK sure but that’s not the same as “we can analyze and manipulate however we want at essentially no cost”.
Do you see what I mean?
But this is irrelevant to the original claim, right? Being able to fine-tune might make introspection on its interal algorithmic representations a bit cheaper, but in practice we observe that it takes us weeks or months of alignment researchers’ time to figure out what extremely tiny slices of two-generations-old LLMs are doing.