You may be referring to Stanfordâs Alpaca? That project took an LLM by Meta that was pre-trained on structured data (think Wikipedia, books), and fine-tuned it using ChatGPT-generated conversations in order to make it more helpful as a chatbot. So the AI-generated data there was only used for a small part of the training, as a final step. (Pre-training is the initial, and I think by far the longest, training phase, where LLMs learn next-token prediction using structured data like Wikipedia.)
SOTA models like GPT-4 are all pre-trained on structured data. (Theyâre then typically turned into chatbots using fine-tuning on conversational data and/âor reinforcement learning from human feedback.) The internet is mostly unstructured data (think Reddit), so thereâs plenty more of that to use, but of course unstructured data is worse quality than structured data. Epoch estimates â with large error bars â that weâll run out of structured (âhigh-qualityâ) text data ~2024 and all internet text data ~2040.
I think ML engineers havenât really hit any data bottleneck yet, so there hasnât been that much activity around using synthetic data (i.e. data thatâs been machine-generated, either with an AI or in some other way). Lots of people, myself included, expect labs to start experimenting more with this as they start running out of high-quality structured data. I also think compute and willingness to spend are and will remain more important bottlenecks to AI progress than data, but Iâm not sure about that.
You may be referring to Stanfordâs Alpaca? That project took an LLM by Meta that was pre-trained on structured data (think Wikipedia, books), and fine-tuned it using ChatGPT-generated conversations in order to make it more helpful as a chatbot. So the AI-generated data there was only used for a small part of the training, as a final step. (Pre-training is the initial, and I think by far the longest, training phase, where LLMs learn next-token prediction using structured data like Wikipedia.)
SOTA models like GPT-4 are all pre-trained on structured data. (Theyâre then typically turned into chatbots using fine-tuning on conversational data and/âor reinforcement learning from human feedback.) The internet is mostly unstructured data (think Reddit), so thereâs plenty more of that to use, but of course unstructured data is worse quality than structured data. Epoch estimates â with large error bars â that weâll run out of structured (âhigh-qualityâ) text data ~2024 and all internet text data ~2040.
I think ML engineers havenât really hit any data bottleneck yet, so there hasnât been that much activity around using synthetic data (i.e. data thatâs been machine-generated, either with an AI or in some other way). Lots of people, myself included, expect labs to start experimenting more with this as they start running out of high-quality structured data. I also think compute and willingness to spend are and will remain more important bottlenecks to AI progress than data, but Iâm not sure about that.