This is a really interesting post, and I appreciate how clearly it is laid out. Thank you for sharing it! But I’m not sure I agree with it, particularly the way that everything is pinned to the imminent arrival of AGI.
Firstly, the two assumptions you spell out in your introduction, that AGI is likely only a few years away, and that it will most likely come from scaled up and refined versions of moden LLMs, are both much more controversial than you suggest (I think)! (Although I’m not confident they are false either)
But even if we accept those assumptions, the third big assumption here is that we can alter a superintelligent AGI’s values in a predictable and straightforward way by just adding in some synethetic training data which expresses the views we like, when building some of its component LLMs. This seems like a strange idea to me!
If we removed some concept from the training data completely, or introduced a new concept that had never appeared otherwise, then I can imagine that having some impact on the AGI’s behaviour. But if all kinds of content are included in significant quantities anyway, then i find it hard to get my head around the inclusion of additional carefully chosen synthetic data having this kind of effect. I guess it clashes with my understanding of what a superintelligent AGI means, to think that its behaviour could be altered with such simple manipulation.
I think an important aspect of this is that even if AGI does come from scaling up and refining LLMs, it is not going to just be a LLM in a straightforward definition of that term (i.e. something that communicates by generating each word with a single forward pass through a neural network). At the very least it must also have some sort of hidden internal monologue where it does chain of thought reasoning, and stores memories, etc.
But I don’t know much about AI alignment, so would be very interested to read and understand more about the reasoning behind this third assumption.
All that said, even ignoring AGI, LLMs are likely going to be used more and more in people’s every day lives over the next few years, so training them to express kinder views towards animal seems like a potentially worthwhile goal anyway. I don’t think AGI needs to come into it!
Thanks for your response. You’re right that imminent AGI from AI similar to LLMs is controversial and I should’ve spelled that out more explicitly. And I agree they wouldn’t be pure LLMs but my understanding is that all the advances people talk about like using o1 wouldn’t alter the impacts of pre-training data significantly.
My intuition is that LLMs (especially base models) work as simulators, outputting whatever seems like the most likely completion. But what seems most likely can only come from the training data. So if we include a lot of pro-animal data (and especially data from animal perspectives) then the LLM is more likely to ‘believe’ that the most likely completion is one which supports animals. E.g. base models are already much more likely to complete text mentioning murder from the perspective that murder is bad, because almost all of their pretraining data treats murder as bad. While it might seem that this is inherently dumb behavior and incompatible with AGI (much less ASI), I think humans work mostly the same way. We like the food and music we grew up with, we mostly internalize the values and factual beliefs we see most often in our society and the more niche some values or factual beliefs are the less willing we are to take it seriously. So going from e.g. 0.0001% data from animal perspectives to 0.1% would be a 1000x increase, and hopefully greatly decrease the chance that astronomical animal suffering is ignored even if the cost to stop it would be small (but non-zero).
This is a really interesting post, and I appreciate how clearly it is laid out. Thank you for sharing it! But I’m not sure I agree with it, particularly the way that everything is pinned to the imminent arrival of AGI.
Firstly, the two assumptions you spell out in your introduction, that AGI is likely only a few years away, and that it will most likely come from scaled up and refined versions of moden LLMs, are both much more controversial than you suggest (I think)! (Although I’m not confident they are false either)
But even if we accept those assumptions, the third big assumption here is that we can alter a superintelligent AGI’s values in a predictable and straightforward way by just adding in some synethetic training data which expresses the views we like, when building some of its component LLMs. This seems like a strange idea to me!
If we removed some concept from the training data completely, or introduced a new concept that had never appeared otherwise, then I can imagine that having some impact on the AGI’s behaviour. But if all kinds of content are included in significant quantities anyway, then i find it hard to get my head around the inclusion of additional carefully chosen synthetic data having this kind of effect. I guess it clashes with my understanding of what a superintelligent AGI means, to think that its behaviour could be altered with such simple manipulation.
I think an important aspect of this is that even if AGI does come from scaling up and refining LLMs, it is not going to just be a LLM in a straightforward definition of that term (i.e. something that communicates by generating each word with a single forward pass through a neural network). At the very least it must also have some sort of hidden internal monologue where it does chain of thought reasoning, and stores memories, etc.
But I don’t know much about AI alignment, so would be very interested to read and understand more about the reasoning behind this third assumption.
All that said, even ignoring AGI, LLMs are likely going to be used more and more in people’s every day lives over the next few years, so training them to express kinder views towards animal seems like a potentially worthwhile goal anyway. I don’t think AGI needs to come into it!
Thanks for your response. You’re right that imminent AGI from AI similar to LLMs is controversial and I should’ve spelled that out more explicitly. And I agree they wouldn’t be pure LLMs but my understanding is that all the advances people talk about like using o1 wouldn’t alter the impacts of pre-training data significantly.
My intuition is that LLMs (especially base models) work as simulators, outputting whatever seems like the most likely completion. But what seems most likely can only come from the training data. So if we include a lot of pro-animal data (and especially data from animal perspectives) then the LLM is more likely to ‘believe’ that the most likely completion is one which supports animals. E.g. base models are already much more likely to complete text mentioning murder from the perspective that murder is bad, because almost all of their pretraining data treats murder as bad. While it might seem that this is inherently dumb behavior and incompatible with AGI (much less ASI), I think humans work mostly the same way. We like the food and music we grew up with, we mostly internalize the values and factual beliefs we see most often in our society and the more niche some values or factual beliefs are the less willing we are to take it seriously. So going from e.g. 0.0001% data from animal perspectives to 0.1% would be a 1000x increase, and hopefully greatly decrease the chance that astronomical animal suffering is ignored even if the cost to stop it would be small (but non-zero).