You may be referring to Stanford’s Alpaca? That project took an LLM by Meta that was pre-trained on structured data (think Wikipedia, books), and fine-tuned it using ChatGPT-generated conversations in order to make it more helpful as a chatbot. So the AI-generated data there was only used for a small part of the training, as a final step. (Pre-training is the initial, and I think by far the longest, training phase, where LLMs learn next-token prediction using structured data like Wikipedia.)
SOTA models like GPT-4 are all pre-trained on structured data. (They’re then typically turned into chatbots using fine-tuning on conversational data and/or reinforcement learning from human feedback.) The internet is mostly unstructured data (think Reddit), so there’s plenty more of that to use, but of course unstructured data is worse quality than structured data. Epoch estimates – with large error bars – that we’ll run out of structured (“high-quality”) text data ~2024 and all internet text data ~2040.
I think ML engineers haven’t really hit any data bottleneck yet, so there hasn’t been that much activity around using synthetic data (i.e. data that’s been machine-generated, either with an AI or in some other way). Lots of people, myself included, expect labs to start experimenting more with this as they start running out of high-quality structured data. I also think compute and willingness to spend are and will remain more important bottlenecks to AI progress than data, but I’m not sure about that.
I read that AI-generated text is being used as input data due to a data shortage. What do you think are some foreseeable implications of this?
You may be referring to Stanford’s Alpaca? That project took an LLM by Meta that was pre-trained on structured data (think Wikipedia, books), and fine-tuned it using ChatGPT-generated conversations in order to make it more helpful as a chatbot. So the AI-generated data there was only used for a small part of the training, as a final step. (Pre-training is the initial, and I think by far the longest, training phase, where LLMs learn next-token prediction using structured data like Wikipedia.)
SOTA models like GPT-4 are all pre-trained on structured data. (They’re then typically turned into chatbots using fine-tuning on conversational data and/or reinforcement learning from human feedback.) The internet is mostly unstructured data (think Reddit), so there’s plenty more of that to use, but of course unstructured data is worse quality than structured data. Epoch estimates – with large error bars – that we’ll run out of structured (“high-quality”) text data ~2024 and all internet text data ~2040.
I think ML engineers haven’t really hit any data bottleneck yet, so there hasn’t been that much activity around using synthetic data (i.e. data that’s been machine-generated, either with an AI or in some other way). Lots of people, myself included, expect labs to start experimenting more with this as they start running out of high-quality structured data. I also think compute and willingness to spend are and will remain more important bottlenecks to AI progress than data, but I’m not sure about that.