Here are some forecasts for near-term progress / impacts of AI on research. They are the results of some small-ish number of hours of reading + thinking, and shouldn’t be taken at all seriously. I’m sharing in case it’s interesting for people and especially to get feedback on my bottom line probabilities and thought processes. I’m pretty sure there are some things I’m very wrong about in the below and I’d love for those to be corrected.
Deepmind will announce excellent performance from Alphafold2 (AF2) or some successor / relative for multi-domain proteins by end of 2023; or some other group will announced this using some AI scheme: 80% probability
Deepmind will announce excellent performance from AF2 or some successor / relative for protein complexes by end of 2023; or some other group will announced this using some AI scheme: 70% probability
Widespread adoption of a system like OpenAI Codex for data analysis will happen by end of 2023: 20% probability
I realise that “excellent performance” etc is vague, I choose to live with that rather than putting in the time to make everything precise (or not doing the exercise at all).
If you don’t know what multi-domain proteins and protein complexes are, I found this Mohammed Al Quraishi blog very useful (maybe try ctrl-f for those terms), although maybe you need to start with some relevant background knowledge. I don’t have a great sense for how big a deal this would be for various areas of biological science, but my impression is that they’re both roughly the same order of magnitude of usefulness as getting excellent performance on single-domain proteins was (i.e. what AF2 has already achieved).
As for why:
80% chance that excellent AI performance on multi-domain proteins is announced by end of 2023
It’s an extremely similar problem to the one they already cracked (single domains) and it seems tractable
Top reasons against
Maybe they won’t announce it, because it’s not newsworthy enough; or they’ll bundle it with some bigger announcement with lots of cool results (resulting in delayed announcement)
Other reasons against
In particular, the results from the next CASP competition will presumably be announced in December 2022; if they haven’t cracked it by then, maybe we won’t hear about it by end of 2023
They’d need to get there by April 2022 (I think that is the submission deadline for CASP)
Maybe it will turn out to be way less tractable than expected
Maybe Deepmind will have other, even more pressing priorities, or some key people will unexpectedly leave, or they’ll lose funding, or something else unexpected happens
Key uncertainties
Are rival protein folding schemes targeting this?
70% chance that excellent AI performance on protein complexes is announced by end of 2023
There is quite a lot of data (though 10x less than for single proteins)
Unsure whether transfer learning (I maybe using the wrong term) is relevant here?
Top reasons against
Maybe even if it’s done by say mid 2023 it won’t be announced until after 2023 because of Deepmind’s media strategy
In particular, targeting a CASP would seem to require the high performance to be achieved by mid 2022; maybe this is the most likely scenario in worlds where Deepmind announces this before end of 2023
(although if Deepmind doesn’t get there by CASP15, it seems like another group might announce something in say 2023)
Protein complexes are (maybe?) qualitatively different to single proteins
Other reasons against
Maybe the lack of data will be decisive
Maybe Deepmind’s priorities will change, etc, as in noted above in the multi-domain case
Key uncertainties
Are rival schemes targeting this?
20% chance of widespread adoption of a system like OpenAI Codex for data analysis by end of 2023
(NB this is just about data analysis / “data science” rather than about usage of Codex in general)
My “best guess” scenario
OpenAI releases an API for data science that is cheap but not free. In its current iteration, the software is “handy” but not more than that. A later iteration, released in 2023, is significantly more powerful and useful. But by the end of 2023 it is still not yet “widely used”.
Some reasons against event happening
Maybe Codex is currently not that useful in practice for data analysis
I think OpenAI won’t release it for free so it won’t become part of the “standard toolkit” in the same way that e.g. RStudio has
Things like RStudio take a long time to diffuse / become adopted
E.g. my guess is ~5 years to get 25% uptake of for ipython notebook or rstudio by data scientists, or something like that
Key uncertainties
How much are OpenAI going to push this on people?
How much are they pushing the data science aspect particularly?
Will this be ~free to use or will it be licensed?
How quickly will it improve? How often does OpenAI release improved versions of things?
How fast did ipython notebook/rstudio get adopted?
Here are some forecasts for near-term progress / impacts of AI on research. They are the results of some small-ish number of hours of reading + thinking, and shouldn’t be taken at all seriously. I’m sharing in case it’s interesting for people and especially to get feedback on my bottom line probabilities and thought processes. I’m pretty sure there are some things I’m very wrong about in the below and I’d love for those to be corrected.
Deepmind will announce excellent performance from Alphafold2 (AF2) or some successor / relative for multi-domain proteins by end of 2023; or some other group will announced this using some AI scheme: 80% probability
Deepmind will announce excellent performance from AF2 or some successor / relative for protein complexes by end of 2023; or some other group will announced this using some AI scheme: 70% probability
Widespread adoption of a system like OpenAI Codex for data analysis will happen by end of 2023: 20% probability
I realise that “excellent performance” etc is vague, I choose to live with that rather than putting in the time to make everything precise (or not doing the exercise at all).
If you don’t know what multi-domain proteins and protein complexes are, I found this Mohammed Al Quraishi blog very useful (maybe try ctrl-f for those terms), although maybe you need to start with some relevant background knowledge. I don’t have a great sense for how big a deal this would be for various areas of biological science, but my impression is that they’re both roughly the same order of magnitude of usefulness as getting excellent performance on single-domain proteins was (i.e. what AF2 has already achieved).
As for why:
80% chance that excellent AI performance on multi-domain proteins is announced by end of 2023
Top reasons for event happening
Deepmind apparently wants to tackle protein complexes (“Based on this and DeepMind’s repeated assertions…”) and multi-domain proteins seem like a stepping stone for that
Mohammed Al Quraishi already thought AF2 dealt with multi-domain proteins “just fine” in Dec 2020
It’s an extremely similar problem to the one they already cracked (single domains) and it seems tractable
Top reasons against
Maybe they won’t announce it, because it’s not newsworthy enough; or they’ll bundle it with some bigger announcement with lots of cool results (resulting in delayed announcement)
Other reasons against
In particular, the results from the next CASP competition will presumably be announced in December 2022; if they haven’t cracked it by then, maybe we won’t hear about it by end of 2023
They’d need to get there by April 2022 (I think that is the submission deadline for CASP)
Maybe it will turn out to be way less tractable than expected
Maybe Deepmind will have other, even more pressing priorities, or some key people will unexpectedly leave, or they’ll lose funding, or something else unexpected happens
Key uncertainties
Are rival protein folding schemes targeting this?
70% chance that excellent AI performance on protein complexes is announced by end of 2023
Top reasons for event happening
Deepmind apparently wants to tackle protein complexes (“Based on this and DeepMind’s repeated assertions…”)
Mohammed Al Quraishi already thought it dealt with multi-domain proteins “just fine” in Dec 2020, and these are a stepping stone to protein complexes
There is quite a lot of data (though 10x less than for single proteins)
Unsure whether transfer learning (I maybe using the wrong term) is relevant here?
Top reasons against
Maybe even if it’s done by say mid 2023 it won’t be announced until after 2023 because of Deepmind’s media strategy
In particular, targeting a CASP would seem to require the high performance to be achieved by mid 2022; maybe this is the most likely scenario in worlds where Deepmind announces this before end of 2023
(although if Deepmind doesn’t get there by CASP15, it seems like another group might announce something in say 2023)
Protein complexes are (maybe?) qualitatively different to single proteins
Other reasons against
Maybe the lack of data will be decisive
Maybe Deepmind’s priorities will change, etc, as in noted above in the multi-domain case
Key uncertainties
Are rival schemes targeting this?
20% chance of widespread adoption of a system like OpenAI Codex for data analysis by end of 2023
(NB this is just about data analysis / “data science” rather than about usage of Codex in general)
My “best guess” scenario
OpenAI releases an API for data science that is cheap but not free. In its current iteration, the software is “handy” but not more than that. A later iteration, released in 2023, is significantly more powerful and useful. But by the end of 2023 it is still not yet “widely used”.
Some reasons against event happening
Maybe Codex is currently not that useful in practice for data analysis
I think OpenAI won’t release it for free so it won’t become part of the “standard toolkit” in the same way that e.g. RStudio has
Things like RStudio take a long time to diffuse / become adopted
E.g. my guess is ~5 years to get 25% uptake of for ipython notebook or rstudio by data scientists, or something like that
Key uncertainties
How much are OpenAI going to push this on people?
How much are they pushing the data science aspect particularly?
Will this be ~free to use or will it be licensed?
How quickly will it improve? How often does OpenAI release improved versions of things?
How fast did ipython notebook/rstudio get adopted?
How much has the GPT-3 API been used so far?
Some useful links
From the OpenAI website: “During the initial period, OpenAI Codex will be offered for free”
This guy at VentureBeat thinks that Microsoft will make the profits from Codex (but that doesn’t seem to rule out it being widely used to assist with data analysis; just implies that the data analysis software will be owned by microsoft, I guess?)
The Verge says “although Codex is initially being released as free API, OpenAI will start charging for access at some point in the future.”
The impression I get from Appendix H of OpenAI’s paper on Codex is that they expect at least fairly wide adoption of Codex or something similar