NickLaing comments on The Prospect of an AI Winter

NickLaing 28 Mar 2023 8:21 UTC
5 points
0 ∶ 0
On the data front, it seems like Chat GPT and other AIs don’t have access to the mass of peer reviewed journals yet. Obviously this isn’t (relatively speaking) a huge quantity of data, but the quality would be orders of magnitude higher than what they are looking at now. Could access to these change things much at all?
- aog 28 Mar 2023 8:50 UTC
  19 points
  3 ∶ 0
  Parent
  That’s a reasonable point, but I don’t think peer reviewed journals would make much difference.
  
  The Pile (https://arxiv.org/pdf/2101.00027.pdf) is a large dataset used for training lots of models. It includes the academic datasets Arxiv, FreeLaw, and PubMed Central, which contain 50GB, 50GB, and 100GB of data respectively. Table 7 says each byte is ~0.2 tokens, so that’s about 40B tokens to represent a good chunk of the academic literature on several subjects. If we had a similarly-sized influx of peer reviewed journals, would that change the data picture?
  
  Chinchilla, a state of the art language model released by DeepMind one year ago, was trained on ~1.4T tokens. Only four years prior, BERT was a SOTA model trained on ~6B tokens. If we assume the Pile includes only 10% of existing academic literature, then peer reviewed journals could represent a 400B token influx that would increase available data by 25% over Chinchilla. This would meaningfully expand the dataset, but not by the orders of magnitude necessary to sustain scaling for months and years.
  - NickLaing 28 Mar 2023 11:01 UTC
    7 points
    2 ∶ 0
    Parent
    Wow what a great answer appreciate it!