On the data front, it seems like Chat GPT and other AIs don’t have access to the mass of peer reviewed journals yet. Obviously this isn’t (relatively speaking) a huge quantity of data, but the quality would be orders of magnitude higher than what they are looking at now. Could access to these change things much at all?
That’s a reasonable point, but I don’t think peer reviewed journals would make much difference.
The Pile (https://arxiv.org/pdf/2101.00027.pdf) is a large dataset used for training lots of models. It includes the academic datasets Arxiv, FreeLaw, and PubMed Central, which contain 50GB, 50GB, and 100GB of data respectively. Table 7 says each byte is ~0.2 tokens, so that’s about 40B tokens to represent a good chunk of the academic literature on several subjects. If we had a similarly-sized influx of peer reviewed journals, would that change the data picture?
Chinchilla, a state of the art language model released by DeepMind one year ago, was trained on ~1.4T tokens. Only four years prior, BERT was a SOTA model trained on ~6B tokens. If we assume the Pile includes only 10% of existing academic literature, then peer reviewed journals could represent a 400B token influx that would increase available data by 25% over Chinchilla. This would meaningfully expand the dataset, but not by the orders of magnitude necessary to sustain scaling for months and years.
On the data front, it seems like Chat GPT and other AIs don’t have access to the mass of peer reviewed journals yet. Obviously this isn’t (relatively speaking) a huge quantity of data, but the quality would be orders of magnitude higher than what they are looking at now. Could access to these change things much at all?
That’s a reasonable point, but I don’t think peer reviewed journals would make much difference.
The Pile (https://arxiv.org/pdf/2101.00027.pdf) is a large dataset used for training lots of models. It includes the academic datasets Arxiv, FreeLaw, and PubMed Central, which contain 50GB, 50GB, and 100GB of data respectively. Table 7 says each byte is ~0.2 tokens, so that’s about 40B tokens to represent a good chunk of the academic literature on several subjects. If we had a similarly-sized influx of peer reviewed journals, would that change the data picture?
Chinchilla, a state of the art language model released by DeepMind one year ago, was trained on ~1.4T tokens. Only four years prior, BERT was a SOTA model trained on ~6B tokens. If we assume the Pile includes only 10% of existing academic literature, then peer reviewed journals could represent a 400B token influx that would increase available data by 25% over Chinchilla. This would meaningfully expand the dataset, but not by the orders of magnitude necessary to sustain scaling for months and years.
Wow what a great answer appreciate it!