I haven’t read the OP (I haven’t read a full forum post in weeks and I don’t like reading, it’s better to like, close your eyes and try coming up with the entire thing from scratch and see if it matches, using high information tags to compare with, generated with a meta model) but I think this is a referral to the usual training/inference cost differences.
For example, you can run GPT-3 Davinci in a few seconds at trivial cost. But the training cost was millions of dollars and took a long time.
There are further considerations. For example, finding the architecture (stacking more things in Torch, fiddling with parameters, figuring out how to implement the Key Insight , etc.) for finding the first breakthrough model is probably further expensive and hard.
Let CT be the computing power used to train the model. Is the idea that “if you could afford CT to train the model, then you can also afford CT for running models”?
Because that doesn’t seem obvious. What if you used 99% of your budget on training? Then you’d only be able to afford 0.01×CT for running models.
Or is this just an example to show that training costs >> running costs?
Yes, that’s how I understood it as well. If you spend the same amount on inference as you did on training, then you get a hell of a lot of inference.
I would expect he’d also argue that, because companies are willing to spend tons of money on training, we should also expect them to be willing to spend lots on inference.
Nearly impossible to answer. This report by OpenPhil gives it a hell of an effort, but could still be wrong by orders of magnitude. Most fundamentally, the amount of compute necessary for AGI might not be related to the amount of compute used by the human brain, because we don’t know how similar our algorithmic efficiency is compared to the brain’s.
So like the terms of art here are “training” versus “inference”. I don’t have a reference or guide (because the relative size is not something that most people think about versus the absolute size of each individually) but if you google them and scroll through some papers or posts I think you will see some clear examples.
Just LARPing here. I don’t really know anything about AI or machine learning.
I guess in some deeper sense you are right and (my simulated version of) what Holden has written is imprecise.
We don’t really see many “continuously” updating models where training continues live with use. So the mundane pattern we see today of inference, where we trivially running the instructions from the model (often on specific silicon made for inference) being much cheaper than training, may not apply for some reason, to the pattern that the out of control AI uses.
It’s not impossible that if the system needs to be self improving, it has to provision a large fraction of its training cost, or something, continually.
It’s not really clear what the “shape” of this “relative cost curve” would be, if this would be a short period of time, and it doesn’t make it any less dangerous.
I haven’t read the OP (I haven’t read a full forum post in weeks and I don’t like reading, it’s better to like, close your eyes and try coming up with the entire thing from scratch and see if it matches, using high information tags to compare with, generated with a meta model) but I think this is a referral to the usual training/inference cost differences.
For example, you can run GPT-3 Davinci in a few seconds at trivial cost. But the training cost was millions of dollars and took a long time.
There are further considerations. For example, finding the architecture (stacking more things in Torch, fiddling with parameters, figuring out how to implement the Key Insight , etc.) for finding the first breakthrough model is probably further expensive and hard.
Let CT be the computing power used to train the model. Is the idea that “if you could afford CT to train the model, then you can also afford CT for running models”?
Because that doesn’t seem obvious. What if you used 99% of your budget on training? Then you’d only be able to afford 0.01×CT for running models.
Or is this just an example to show that training costs >> running costs?
Yes, that’s how I understood it as well. If you spend the same amount on inference as you did on training, then you get a hell of a lot of inference.
I would expect he’d also argue that, because companies are willing to spend tons of money on training, we should also expect them to be willing to spend lots on inference.
Do we know the expected cost for training an AGI? Is that within a single company’s budget?
Nearly impossible to answer. This report by OpenPhil gives it a hell of an effort, but could still be wrong by orders of magnitude. Most fundamentally, the amount of compute necessary for AGI might not be related to the amount of compute used by the human brain, because we don’t know how similar our algorithmic efficiency is compared to the brain’s.
https://www.cold-takes.com/forecasting-transformative-ai-the-biological-anchors-method-in-a-nutshell/
Yes, the last sentence is exactly correct.
So like the terms of art here are “training” versus “inference”. I don’t have a reference or guide (because the relative size is not something that most people think about versus the absolute size of each individually) but if you google them and scroll through some papers or posts I think you will see some clear examples.
Just LARPing here. I don’t really know anything about AI or machine learning.
I guess in some deeper sense you are right and (my simulated version of) what Holden has written is imprecise.
We don’t really see many “continuously” updating models where training continues live with use. So the mundane pattern we see today of inference, where we trivially running the instructions from the model (often on specific silicon made for inference) being much cheaper than training, may not apply for some reason, to the pattern that the out of control AI uses.
It’s not impossible that if the system needs to be self improving, it has to provision a large fraction of its training cost, or something, continually.
It’s not really clear what the “shape” of this “relative cost curve” would be, if this would be a short period of time, and it doesn’t make it any less dangerous.