If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff.
But this is irrelevant to the original claim, right? Being able to fine-tune might make introspection on its interal algorithmic representations a bit cheaper, but in practice we observe that it takes us weeks or months of alignment researchers’ time to figure out what extremely tiny slices of two-generations-old LLMs are doing.
But this is irrelevant to the original claim, right? Being able to fine-tune might make introspection on its interal algorithmic representations a bit cheaper, but in practice we observe that it takes us weeks or months of alignment researchers’ time to figure out what extremely tiny slices of two-generations-old LLMs are doing.