Yes, current ML is very sample-inefficient, but that’s not a reason to say that the bound should be based on current sample efficiency. Second, the long-term feedback from series of actions is what policy nets in AlphaGo addressed, and I agree that it requires simulating an environment, but it says little about how rich that environment needs to be. And because running models is much cheaper than training them, self-play is tractable as a way to generate complex environments.
But overall, I agree that it’s an assertion which would need to be defended.
Yes, current ML is very sample-inefficient, but that’s not a reason to say that the bound should be based on current sample efficiency. Second, the long-term feedback from series of actions is what policy nets in AlphaGo addressed, and I agree that it requires simulating an environment, but it says little about how rich that environment needs to be. And because running models is much cheaper than training them, self-play is tractable as a way to generate complex environments.
But overall, I agree that it’s an assertion which would need to be defended.