In thinking about how you might identify future pandemics by
sequencing wastewater, you might have a goal of raising an alert
before some fraction of people were currently infected. What you’re
actually able to observe, however, are sequencing reads, several steps removed
from infection rates. Can we use covid data to estimate how the
fraction of people currently infected with some pathogen might
translate into the fraction of wastewater sequencing reads that match
the pathogen?
In that post I looked at a single pathogen (SARS-CoV-2) in a single
metagenomic sequencing dataset (Rothman
et al 2021) and got a very rough point estimate (2.3e-8 relative
abundance at 0.1% prevalence). What fraction of sequencing reads
might come from a novel pathogen at some level of prevalence continues
to be a key question, however, and this quarter I’m working with
several other people at the NAO in trying to get a
better understanding here.
Specifically, we’d like to understand how relative abundance (fraction
of sequencing reads matching an organism) varies with of prevalence
(what fraction of people are currently infected) and organism (ex:
since we’re sampling wastewater you’d expect disproportionately more
gastrointestinal than blood pathogens).
Here’s the current plan:
Gather wastewater metagenomic sequencing data, mostly by
looking at papers that published it in the Sequencing Read Archive.
I’d love it if we could also include our own data here, but we aren’t
far enough along to have much yet.
Process the sequencing data (code) to
clean it (remove adapters, trim low-quality bases, collapse paired-end
reads) and identify the reads (assign them to taxonomic nodes).
Gather corresponding estimates for the prevalence of various
human viruses in the populations contributing to the metagenomic
data. (code)
Build and fit a model for relative abundance as a function of
prevalence, sequencing method, and the type of organism.
Overall, this would be a big step forwards towards estimating the
feasibility of this kind of detection: cost should be inversely
proportional to relative abundance.
We’re reasonably far along on (1) and (2), and if you’re curious you
can poke around.
That shows the counts for human-infecting viruses across samples.
It’s rough (ex: we’re
not doing any correction for PCR duplication yet) so don’t
take it too seriously and let us
know if you see something suspicious. On (3) and (4) things are
much earlier: we currently have prevalence estimates for five
viruses and I’d like to get at least ten times this many.
(If you’re curious why I haven’t been talking more about writing a
book since my post a month ago, this is
a lot of it. Right around when I posted that I moved from mostly
doing individual work to leading this project, and the opportunity
cost of taking time away became much higher. I do still want to write
something summarizing the advice I got from people around making a
book, though, and it’s possible I’ll come back to the book project.)
This post describes in-progress work at the NAO and covers work from a
team including Simon Grimm
and Asher Parker-Sartori estimating prevalences, Dan Rice modeling,
Will Bradshaw evaluating sequencing methods, and Mike McLaren identifying relevant
papers and providing general technical guidance.
Genetic Sequencing of Wastewater: Prevalence to Relative Abundance
Back in September I wrote:
In that post I looked at a single pathogen (SARS-CoV-2) in a single metagenomic sequencing dataset (Rothman et al 2021) and got a very rough point estimate (2.3e-8 relative abundance at 0.1% prevalence). What fraction of sequencing reads might come from a novel pathogen at some level of prevalence continues to be a key question, however, and this quarter I’m working with several other people at the NAO in trying to get a better understanding here.
Specifically, we’d like to understand how relative abundance (fraction of sequencing reads matching an organism) varies with of prevalence (what fraction of people are currently infected) and organism (ex: since we’re sampling wastewater you’d expect disproportionately more gastrointestinal than blood pathogens).
Here’s the current plan:
Gather wastewater metagenomic sequencing data, mostly by looking at papers that published it in the Sequencing Read Archive. I’d love it if we could also include our own data here, but we aren’t far enough along to have much yet.
Process the sequencing data (code) to clean it (remove adapters, trim low-quality bases, collapse paired-end reads) and identify the reads (assign them to taxonomic nodes).
Gather corresponding estimates for the prevalence of various human viruses in the populations contributing to the metagenomic data. (code)
Build and fit a model for relative abundance as a function of prevalence, sequencing method, and the type of organism.
Overall, this would be a big step forwards towards estimating the feasibility of this kind of detection: cost should be inversely proportional to relative abundance.
We’re reasonably far along on (1) and (2), and if you’re curious you can poke around. That shows the counts for human-infecting viruses across samples. It’s rough (ex: we’re not doing any correction for PCR duplication yet) so don’t take it too seriously and let us know if you see something suspicious. On (3) and (4) things are much earlier: we currently have prevalence estimates for five viruses and I’d like to get at least ten times this many.
(If you’re curious why I haven’t been talking more about writing a book since my post a month ago, this is a lot of it. Right around when I posted that I moved from mostly doing individual work to leading this project, and the opportunity cost of taking time away became much higher. I do still want to write something summarizing the advice I got from people around making a book, though, and it’s possible I’ll come back to the book project.)
This post describes in-progress work at the NAO and covers work from a team including Simon Grimm and Asher Parker-Sartori estimating prevalences, Dan Rice modeling, Will Bradshaw evaluating sequencing methods, and Mike McLaren identifying relevant papers and providing general technical guidance.