once the first human-level AI system is created, whoever created it could use the same computing power it took to create it in order to run several hundred million copies for about a year each.
How does computing power work here? Is it:
We use a supercomputer to train the AI, then the supercomputer is just sitting there, so we can use it to run models. Or:
We’re renting a server to do the training, and then have to rent more servers to run the models.
In (2), we might use up our whole budget on the training, and then not be able to afford to run any models.
Sorry for chiming in so late! The basic idea here is that if you have 2x the resources it would take to train a transformative model, then you have enough to run a huge number of them.
It’s true that the first transformative model might eat all the resources its developer has at the time. But it seems likely that (a) given that they’ve raised $X to train it as a reasonably speculative project, once it turns out to be transformative there will probably be at least a further $X available to pay for running copies; (b) not too long after, as compute continues to get more efficient, someone will have the 2x the resources needed to train the model.
Basically, is the computing power for training a fixed cost or a variable cost? If it’s a fixed cost, then there’s no further cost to using the same computing power to train models.
I haven’t read the OP (I haven’t read a full forum post in weeks and I don’t like reading, it’s better to like, close your eyes and try coming up with the entire thing from scratch and see if it matches, using high information tags to compare with, generated with a meta model) but I think this is a referral to the usual training/inference cost differences.
For example, you can run GPT-3 Davinci in a few seconds at trivial cost. But the training cost was millions of dollars and took a long time.
There are further considerations. For example, finding the architecture (stacking more things in Torch, fiddling with parameters, figuring out how to implement the Key Insight , etc.) for finding the first breakthrough model is probably further expensive and hard.
Let CT be the computing power used to train the model. Is the idea that “if you could afford CT to train the model, then you can also afford CT for running models”?
Because that doesn’t seem obvious. What if you used 99% of your budget on training? Then you’d only be able to afford 0.01×CT for running models.
Or is this just an example to show that training costs >> running costs?
Yes, that’s how I understood it as well. If you spend the same amount on inference as you did on training, then you get a hell of a lot of inference.
I would expect he’d also argue that, because companies are willing to spend tons of money on training, we should also expect them to be willing to spend lots on inference.
Nearly impossible to answer. This report by OpenPhil gives it a hell of an effort, but could still be wrong by orders of magnitude. Most fundamentally, the amount of compute necessary for AGI might not be related to the amount of compute used by the human brain, because we don’t know how similar our algorithmic efficiency is compared to the brain’s.
So like the terms of art here are “training” versus “inference”. I don’t have a reference or guide (because the relative size is not something that most people think about versus the absolute size of each individually) but if you google them and scroll through some papers or posts I think you will see some clear examples.
Just LARPing here. I don’t really know anything about AI or machine learning.
I guess in some deeper sense you are right and (my simulated version of) what Holden has written is imprecise.
We don’t really see many “continuously” updating models where training continues live with use. So the mundane pattern we see today of inference, where we trivially running the instructions from the model (often on specific silicon made for inference) being much cheaper than training, may not apply for some reason, to the pattern that the out of control AI uses.
It’s not impossible that if the system needs to be self improving, it has to provision a large fraction of its training cost, or something, continually.
It’s not really clear what the “shape” of this “relative cost curve” would be, if this would be a short period of time, and it doesn’t make it any less dangerous.
How does computing power work here? Is it:
We use a supercomputer to train the AI, then the supercomputer is just sitting there, so we can use it to run models. Or:
We’re renting a server to do the training, and then have to rent more servers to run the models.
In (2), we might use up our whole budget on the training, and then not be able to afford to run any models.
Sorry for chiming in so late! The basic idea here is that if you have 2x the resources it would take to train a transformative model, then you have enough to run a huge number of them.
It’s true that the first transformative model might eat all the resources its developer has at the time. But it seems likely that (a) given that they’ve raised $X to train it as a reasonably speculative project, once it turns out to be transformative there will probably be at least a further $X available to pay for running copies; (b) not too long after, as compute continues to get more efficient, someone will have the 2x the resources needed to train the model.
Basically, is the computing power for training a fixed cost or a variable cost? If it’s a fixed cost, then there’s no further cost to using the same computing power to train models.
I haven’t read the OP (I haven’t read a full forum post in weeks and I don’t like reading, it’s better to like, close your eyes and try coming up with the entire thing from scratch and see if it matches, using high information tags to compare with, generated with a meta model) but I think this is a referral to the usual training/inference cost differences.
For example, you can run GPT-3 Davinci in a few seconds at trivial cost. But the training cost was millions of dollars and took a long time.
There are further considerations. For example, finding the architecture (stacking more things in Torch, fiddling with parameters, figuring out how to implement the Key Insight , etc.) for finding the first breakthrough model is probably further expensive and hard.
Let CT be the computing power used to train the model. Is the idea that “if you could afford CT to train the model, then you can also afford CT for running models”?
Because that doesn’t seem obvious. What if you used 99% of your budget on training? Then you’d only be able to afford 0.01×CT for running models.
Or is this just an example to show that training costs >> running costs?
Yes, that’s how I understood it as well. If you spend the same amount on inference as you did on training, then you get a hell of a lot of inference.
I would expect he’d also argue that, because companies are willing to spend tons of money on training, we should also expect them to be willing to spend lots on inference.
Do we know the expected cost for training an AGI? Is that within a single company’s budget?
Nearly impossible to answer. This report by OpenPhil gives it a hell of an effort, but could still be wrong by orders of magnitude. Most fundamentally, the amount of compute necessary for AGI might not be related to the amount of compute used by the human brain, because we don’t know how similar our algorithmic efficiency is compared to the brain’s.
https://www.cold-takes.com/forecasting-transformative-ai-the-biological-anchors-method-in-a-nutshell/
Yes, the last sentence is exactly correct.
So like the terms of art here are “training” versus “inference”. I don’t have a reference or guide (because the relative size is not something that most people think about versus the absolute size of each individually) but if you google them and scroll through some papers or posts I think you will see some clear examples.
Just LARPing here. I don’t really know anything about AI or machine learning.
I guess in some deeper sense you are right and (my simulated version of) what Holden has written is imprecise.
We don’t really see many “continuously” updating models where training continues live with use. So the mundane pattern we see today of inference, where we trivially running the instructions from the model (often on specific silicon made for inference) being much cheaper than training, may not apply for some reason, to the pattern that the out of control AI uses.
It’s not impossible that if the system needs to be self improving, it has to provision a large fraction of its training cost, or something, continually.
It’s not really clear what the “shape” of this “relative cost curve” would be, if this would be a short period of time, and it doesn’t make it any less dangerous.