More generally, I take issue in your with the idea that the number of “AI researchers” scales linearly with effective compute (gamma = 1 is put forward as your default hypothesis), and that these “AI researchers” can be assumed to have the same attributes as human researchers, like their beta value.
If you double the thinking time of chatgpt, or their training time, do you get twice as good results? Empirically no. From openAI themselves, you need exponential increases in compute to get linear results in accuracy:
Running two AI systems in parallel is just not the same as hiring two different researchers. Each researcher brings with them new ideas, training, and backgrounds: while each AI is an identical clone. If you think this will change in the future that’s fine, but it’s a pretty big assumption imo.
Again, thanks for this! I think this is an important issue which it might be worth addressing more directly in the paper. Two comments, and I’m interested in what you think about them.
Comment 1: I’m not sure that the analogy to the relationship between compute and accuracy here is apt. When we duplicate an automated AI researcher, we are not trying to improve our accuracy on a single task, we are working on multiple tasks in parallel.
Comment 2: I do think the analogy to cloning is apt. Consider some talented ML researcher at a top lab — call her Ava. We can ask: If we had a duplication machine such that we could make any number of copies of Ava, how would the total quality-weighted research effort contributed by making n copies compare to the total quality-weighted research effort contributed by instead hiring n additional engineers? It is correct that the copies of Ava will not come into the world with new ideas, training, or background. But I imagine this would not be such a huge limitation for them, since engineers can and do retrain and come up with new ideas. On the other hand, hiring n additional engineers is sampling without replacement from a fixed pool of talent, so we should expect that as we hire more engineers, they will become more and more inferior on average to Ava.
So in sum, I suppose I agree with you that setting gamma to 1 is not strictly speaking a conceptual truth, but given the cognitive flexibility we are assuming when we talk about a human-level digital machine learning researcher, I feel confident that gamma is approximately 1.
No worries, I’m glad you find these critiques helpful!
I think the identical clone thing is an interesting thought experiment, and one that perhaps reveals some differences in worldview. I think duplicating Ava a couple of times would lead to roughly linear increase in output, sure: but if you kept duplicating you’d run into diminishing returns. A large software company who’s engineers were entirely replaced with Ava’s would be a literal groupthink factory: all of the blindspots and biases of Ava would be completely entrenched and make the whole enterprise brittle.
I think the push and pull of different personalities is essential to creative production in science: If you look at the history of scientific developments progress is rarely the work of a single genius: more typically it driven by collaborations and fierce disagreements.
With regards to comment 1: yeah “accuracy” is an imperfect proxy, but I think it makes more sense than “number of tasks done” as a measure of algorithmic progress. This seems like an area where quality matters more than quantity. If I’m using Chatgpt to generate ideas for a research project, will running five different instances lead to the final ideas being five times as good?
I feel like there’s a hidden assumption here that AI will at some point switch from acting like LLM’s act in reality to acting like a “little guy in the computer”. I don’t think this is the case, I think AI may end up having different advantages and disadvantages when compared to human researchers.
More generally, I take issue in your with the idea that the number of “AI researchers” scales linearly with effective compute (gamma = 1 is put forward as your default hypothesis), and that these “AI researchers” can be assumed to have the same attributes as human researchers, like their beta value.
If you double the thinking time of chatgpt, or their training time, do you get twice as good results? Empirically no. From openAI themselves, you need exponential increases in compute to get linear results in accuracy:
Running two AI systems in parallel is just not the same as hiring two different researchers. Each researcher brings with them new ideas, training, and backgrounds: while each AI is an identical clone. If you think this will change in the future that’s fine, but it’s a pretty big assumption imo.
Again, thanks for this! I think this is an important issue which it might be worth addressing more directly in the paper. Two comments, and I’m interested in what you think about them.
Comment 1: I’m not sure that the analogy to the relationship between compute and accuracy here is apt. When we duplicate an automated AI researcher, we are not trying to improve our accuracy on a single task, we are working on multiple tasks in parallel.
Comment 2: I do think the analogy to cloning is apt. Consider some talented ML researcher at a top lab — call her Ava. We can ask: If we had a duplication machine such that we could make any number of copies of Ava, how would the total quality-weighted research effort contributed by making n copies compare to the total quality-weighted research effort contributed by instead hiring n additional engineers? It is correct that the copies of Ava will not come into the world with new ideas, training, or background. But I imagine this would not be such a huge limitation for them, since engineers can and do retrain and come up with new ideas. On the other hand, hiring n additional engineers is sampling without replacement from a fixed pool of talent, so we should expect that as we hire more engineers, they will become more and more inferior on average to Ava.
So in sum, I suppose I agree with you that setting gamma to 1 is not strictly speaking a conceptual truth, but given the cognitive flexibility we are assuming when we talk about a human-level digital machine learning researcher, I feel confident that gamma is approximately 1.
No worries, I’m glad you find these critiques helpful!
I think the identical clone thing is an interesting thought experiment, and one that perhaps reveals some differences in worldview. I think duplicating Ava a couple of times would lead to roughly linear increase in output, sure: but if you kept duplicating you’d run into diminishing returns. A large software company who’s engineers were entirely replaced with Ava’s would be a literal groupthink factory: all of the blindspots and biases of Ava would be completely entrenched and make the whole enterprise brittle.
I think the push and pull of different personalities is essential to creative production in science: If you look at the history of scientific developments progress is rarely the work of a single genius: more typically it driven by collaborations and fierce disagreements.
With regards to comment 1: yeah “accuracy” is an imperfect proxy, but I think it makes more sense than “number of tasks done” as a measure of algorithmic progress. This seems like an area where quality matters more than quantity. If I’m using Chatgpt to generate ideas for a research project, will running five different instances lead to the final ideas being five times as good?
I feel like there’s a hidden assumption here that AI will at some point switch from acting like LLM’s act in reality to acting like a “little guy in the computer”. I don’t think this is the case, I think AI may end up having different advantages and disadvantages when compared to human researchers.