Computational Approaches to Pathogen Detection
While this post is my perspective and not an official post of my employerâs, it also draws on a lot of collaborative work with others at the Nucleic Acid Observatory (NAO)
One of the future scenarios Iâm most worried about is someone creating a âstealthâ pandemic. Imagine a future HIV that first infects a large number of people with minimal side effects and only shows its nasty side after it has spread very widely. This is not something weâre prepared for today: current detection approaches (symptom reporting, a doctor noticing an unusual pattern) require visible effects.
Over the last year, with my colleagues at the NAO, Iâve been exploring one promising method of identifying this sort of pandemic. The overall idea is:
Collect some sort of biological material from a lot of people on an ongoing basis, for example by sampling sewage.
Use metagenomic sequencing to learn what nucleic acids are in these samples.
Run novel-pathogen detection algorithms on the sequencing data.
When you find some thing sufficiently concerning, follow-up with tests for the specific thing youâve found.
While there are important open questions in all four of these, Iâve been most focused on the third: once you have metagenomic sequencing data, what do you do?
I see four main approaches. You can look for sequences that are:
Dangerous:There are some genetic sequences that code for dangerous things that we should not normally see in our samples. If you see a series of base pairs that are unique to smallpox thatâs very concerning! The main downside of this approach if you want to extend it beyond smallpox etc is that you need to make a list of non-obvious dangerous things, which is in itself a dangerous thing to do: what if your list is stolen and it points people to sequences they wouldnât have thought to try using?
This is similar to another problem: how do you check if people are synthesizing dangerous sequences without risking a list of all the things that shouldnât be synthesized? SecureDNA has been working on this problem, with an encrypted database with a distributed key system that allows flagging sequences without it being practical to get a list of all flagged sequences (paper).
There are some blockers to using SecureDNA for this today, since it was designed for slightly different constraints, but I think they are all surmountable and Iâm hoping to implement a SecureDNA-based metagenomic sequencing screening system at some point in the next year.
An alternative and somewhat longer-term approach here would be to use tools that are able to estimate the function of novel sequences to extend this to sequences that arenât closely derived from existing ones. Iâm less enthusiastic about this: not only could this work end up increasing risk by improving humanityâs ability to judge how dangerous a novel sequence is, itâs not clear to me that this approach is likely to catch things the other methods wouldnât.
Modified:In engineering a new virus for a stealth pandemic, the easiest way would likely be to begin with an existing virus. If we see a sequencing read where part matches a known viral genome and part does not (a âchimeraâ), one potential explanation is that the read comes from a genetically engineered virus.
But this is not the only reason this approach could flag a read. For example, it could come from:
Lack of knowledge. Perhaps a virus has a lot of variation, much more than is reflected in the databases you are using to define ânormalâ. It will look like you have found a novel virus when itâs just an incomplete database. And, of course, the database will always be incomplete: viruses are always evolving. Still, solving this seems practical: handling these initial false positives requires expanding our knowledge of the variety of existing viruses, but that is something many virologists are deeply interested in.
Sequencing: perhaps some of the biological processing you do prior to (or during) sequencing can attach unrelated fragments. When you see a chimera how do you know whether that existed in the sample you originally collected vs if it was created accidentally in the lab? On the other hand, you can (a) compare the fraction of chimeras in different sequencing approaches and pick ones where this is rare and (b) pay more attention to cases where youâve seen the same chimera multiple times.
Biological chimerism: bacteria will occasionally incorporate viral sequences. This method would flag this as genetic engineering even if it was a natural and unconcerning process. As long as this is rare enough, however, we can deal with this by surfacing such reads to a biologist who figures out how concerned to be and what next steps makes sense.
This is the main approach Iâve been working on lately, trying to get the false positive rate down.
New:If we understood what ânormalâ looked like well enough, then we could flag anything new for investigation. This is a serious research project: if you take data from a sewage sample and run it through basic tooling, itâs common to have 50% of reads unclassified. Making progress here will require, among other things, much better tooling (and maybe algorithms) for metagenomic assembly: Iâm not aware of anything that could efficiently integrate trillions of bases a week into an assembly graph.
Ryan Teo, a first-year graduate student with Nicole Wheeler at the University of Birmingham has started his thesis in this area, which Iâm really excited to see. Lenni Justen, another first-year graduate student, with Kevin Esvelt, is also exploring this area as part of his work with the NAO. Iâd be excited to see more work, however, and if youâre working on this or interested in working on it but blocked by not having access to enough metagenomic sequencing data please get in touch!
Growing:It may turn out that our samples are deeply complex: potentially as you sequence the rate of seeing new things falls off very slowly. If it falls off slowly enough, and then you will keep seeing ânewâ things that are just so rare that you havenât happened to see them before. I am quite unsure how likely this is, and I expect it varies by sample type (sewage is likely much more complex than, say, blood) but it seems possible. An approach thatâs robust to this is that instead of flagging some thing just for being new, you could flag it based on its growth pattern: first youâve never seen it, then you see it once, then you start seeing it more often, then you start seeing it many times per sample. In theory a new pandemic should begin with approximately exponentially spread, since with few people already infected the number of new infections should be proportional to the number of infectious people.
At the NAO weâve been calling this âexponential growth detectionâ (EGD). We worked on this some in 2022, but have put it on hold until we have a deep enough timeseries dataset to work with.
These approaches can also be combined: if a sequence originally comes to your attention because itâs chimeric but youâre not sure how seriously to take it, you could look at the growth pattern of its components. Or, while you can detect growing things with a genome-free approach simply by looking for increasing k-mers, the kind of âthoroughly understand the metagenomeâ work that I described above as an approach for identifying new things can also be used to make a much more sensitive tool that detects growing things.
In terms of prioritization, Iâm enthusiastic about work on all of these, and would like to see them progress in parallel. The approaches of detecting dangerous and modified sequences require less scientific progress and should work on amounts of data that are achievable with philanthropic funding. De novo protein design is getting more capable and more accessible, however, which allows creation of pathogens those two methods donât catch. We will need approaches that donât depend on matching known things, which is where detecting new and/âor growing sequences comes in. Those two methods will require a lot more data, enough that unless sequencing goes through another round of the kind of massive cost improvement we saw in 2008-2011 weâre talking about large-scale government-funded projects. Advances in detection methods make it more likely that weâll be able to make the case for these larger projects, and reduce the risk that the detection ability might lag infrastructure creation.
Thanks for sharing, Jeff! Are you aware of any analyses quantifying the risk of âstealthâ pandemics in terms of expected deaths or probability of a certain death toll in a given period?
Iâm not, sorry!
No worries! The paper Existential Risk and Cost-Effective Biosecurity is the quantification effort I am aware of. I like it, but was looking for more because it still involves some guesses which arguably warrant further investigation:
Executive summary: The post discusses computational approaches for detecting novel pathogens in metagenomic sequencing data as a way to identify stealth pandemics before widespread infection.
Key points:
Metagenomic sequencing of biological samples like sewage can reveal nucleic acids from novel pathogens.
Algorithms can flag concerning sequences that are dangerous, modified from known pathogens, completely new, or growing exponentially.
Each approach has benefits and challenges in accuracy, requiring updates to databases and biological knowledge.
Secure encrypted databases like SecureDNA may reduce risks of hijacked pathogen watchlists.
Research is needed to improve metagenomic assembly and timeseries analysis for detecting novel and growing sequences.
Parallel development of multiple detection approaches is recommended, with short-term focus on known dangerous and modified sequences.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.