Minor point: The Discovering Latent Knowledge Github appears empty.
Also, regarding the data poisoning benchmark, This Is Fine, I’m curious if this is actually a good benchmark for resistance to data poisoning. The actual thing we seem to be measuring here is speed of transfer learning, and declaring that slower is better. While slower speed of learning does increase resistance to data poisoning, it also seems bad for everything else we might want our AI to do. To me, this is basically a fine-tuning benchmark that we’ve then inverted. (After all, if a neural network always outputted the number 42 no matter what, it would score the maximum on TIF—there is no sequence of wrong prompts that can cause it to elicit the number 38 instead, because it is incapable of learning anything. Nevertheless, this is not where we want LLM’s to go in the future.)
A better benchmark would probably be to take data poisoned examples and real fine-tuning, fine-tune the model on each, and compare how much it learns in both cases. With current capabilities, it might not be possible to score above baseline on this benchmark since I don’t know if we actually have ways for the model to filter out data poisoned examples—nevertheless, this would bring awareness to the problem and actually measure what we want to measure more accurately.
Great stuff! Thanks for running this!
Minor point: The Discovering Latent Knowledge Github appears empty.
Also, regarding the data poisoning benchmark, This Is Fine, I’m curious if this is actually a good benchmark for resistance to data poisoning. The actual thing we seem to be measuring here is speed of transfer learning, and declaring that slower is better. While slower speed of learning does increase resistance to data poisoning, it also seems bad for everything else we might want our AI to do. To me, this is basically a fine-tuning benchmark that we’ve then inverted. (After all, if a neural network always outputted the number 42 no matter what, it would score the maximum on TIF—there is no sequence of wrong prompts that can cause it to elicit the number 38 instead, because it is incapable of learning anything. Nevertheless, this is not where we want LLM’s to go in the future.)
A better benchmark would probably be to take data poisoned examples and real fine-tuning, fine-tune the model on each, and compare how much it learns in both cases. With current capabilities, it might not be possible to score above baseline on this benchmark since I don’t know if we actually have ways for the model to filter out data poisoned examples—nevertheless, this would bring awareness to the problem and actually measure what we want to measure more accurately.