I’ve been thinking about this specific idea:
Intuitively, I think it makes sense that data should be the limiting factor of AI growth. A human with an IQ of 150 growing up in the rainforest will be very good at identifying plants, but won’t all of a sudden discover quantum physics. Similarly, an AI trained on only images of trees, even with compute 100 times more than we have now, will not be able to make progress in quantum physics.
It seems to me that you’re making the point that extreme out-of-distribution domains are unreachable by generalization (at least rapidly). Let’s consider that humans actually went from only identifying plants to making progress in quantum physics. How did this happen?
Humans didn’t do it all of a sudden. It was only possible in stepwise fashion spanning generations, and required building on past knowledge (the way to climb ten steps up the ladder is simply to climb one step at a time ten times over).
Human population increases meant that more people were working on learning new knowledge.
Humans had to (as you point out) gather new information (not in our rainforest training set) in order to learn new insights.
Humans often had to test their insights to gain practical knowledge (which you also point out with respect to theoretical vs experimental physics)
If we assume that generating high-quality synthetic data would not allow for new knowledge outside of the learned domain, you would necessarily have to gather new information that humans have not gathered yet to not hit the data ceiling. As long as humans are required to gather new information, it’s reasonable to assume that sustainable exponential improvement is unlikely, since human information-gathering speed would not increase in tandem. Okay, let’s remove the human bottleneck. In this case, an exponentially improving AI would have to find a way to gather information from the outside world with exponentially increasing speeds (as well as test insights/theories at those speeds). Can you think of any way this would be possible? Otherwise, I find it hard not to reach the same conclusion as you.
It’s interesting how OpenAI basically concedes that it’s a fruitless effort further down in the very same post:
It’s not hard to imagine compute eventually becoming cheap and fast enough to train GPT4+ models on high-end consumer computers. How does one limit homebrewed training runs without limiting capabilities that are also used for non-training purposes?