That’s a reasonable point, but I don’t think peer reviewed journals would make much difference.
The Pile (https://arxiv.org/pdf/2101.00027.pdf) is a large dataset used for training lots of models. It includes the academic datasets Arxiv, FreeLaw, and PubMed Central, which contain 50GB, 50GB, and 100GB of data respectively. Table 7 says each byte is ~0.2 tokens, so that’s about 40B tokens to represent a good chunk of the academic literature on several subjects. If we had a similarly-sized influx of peer reviewed journals, would that change the data picture?
Chinchilla, a state of the art language model released by DeepMind one year ago, was trained on ~1.4T tokens. Only four years prior, BERT was a SOTA model trained on ~6B tokens. If we assume the Pile includes only 10% of existing academic literature, then peer reviewed journals could represent a 400B token influx that would increase available data by 25% over Chinchilla. This would meaningfully expand the dataset, but not by the orders of magnitude necessary to sustain scaling for months and years.
That’s a reasonable point, but I don’t think peer reviewed journals would make much difference.
The Pile (https://arxiv.org/pdf/2101.00027.pdf) is a large dataset used for training lots of models. It includes the academic datasets Arxiv, FreeLaw, and PubMed Central, which contain 50GB, 50GB, and 100GB of data respectively. Table 7 says each byte is ~0.2 tokens, so that’s about 40B tokens to represent a good chunk of the academic literature on several subjects. If we had a similarly-sized influx of peer reviewed journals, would that change the data picture?
Chinchilla, a state of the art language model released by DeepMind one year ago, was trained on ~1.4T tokens. Only four years prior, BERT was a SOTA model trained on ~6B tokens. If we assume the Pile includes only 10% of existing academic literature, then peer reviewed journals could represent a 400B token influx that would increase available data by 25% over Chinchilla. This would meaningfully expand the dataset, but not by the orders of magnitude necessary to sustain scaling for months and years.
Wow what a great answer appreciate it!