Genetic Sequencing of Wastewater: Prevalence to Relative Abundance

Back in September I wrote:

In thinking about how you might identify future pandemics by sequencing wastewater, you might have a goal of raising an alert before some fraction of people were currently infected. What you’re actually able to observe, however, are sequencing reads, several steps removed from infection rates. Can we use covid data to estimate how the fraction of people currently infected with some pathogen might translate into the fraction of wastewater sequencing reads that match the pathogen?

In that post I looked at a single pathogen (SARS-CoV-2) in a single metagenomic sequencing dataset (Rothman et al 2021) and got a very rough point estimate (2.3e-8 relative abundance at 0.1% prevalence). What fraction of sequencing reads might come from a novel pathogen at some level of prevalence continues to be a key question, however, and this quarter I’m working with several other people at the NAO in trying to get a better understanding here.

Specifically, we’d like to understand how relative abundance (fraction of sequencing reads matching an organism) varies with of prevalence (what fraction of people are currently infected) and organism (ex: since we’re sampling wastewater you’d expect disproportionately more gastrointestinal than blood pathogens).

Here’s the current plan:

Gather wastewater metagenomic sequencing data, mostly by looking at papers that published it in the Sequencing Read Archive. I’d love it if we could also include our own data here, but we aren’t far enough along to have much yet.
Process the sequencing data (code) to clean it (remove adapters, trim low-quality bases, collapse paired-end reads) and identify the reads (assign them to taxonomic nodes).
Gather corresponding estimates for the prevalence of various human viruses in the populations contributing to the metagenomic data. (code)
Build and fit a model for relative abundance as a function of prevalence, sequencing method, and the type of organism.

Overall, this would be a big step forwards towards estimating the feasibility of this kind of detection: cost should be inversely proportional to relative abundance.

We’re reasonably far along on (1) and (2), and if you’re curious you can poke around. That shows the counts for human-infecting viruses across samples. It’s rough (ex: we’re not doing any correction for PCR duplication yet) so don’t take it too seriously and let us know if you see something suspicious. On (3) and (4) things are much earlier: we currently have prevalence estimates for five viruses and I’d like to get at least ten times this many.

(If you’re curious why I haven’t been talking more about writing a book since my post a month ago, this is a lot of it. Right around when I posted that I moved from mostly doing individual work to leading this project, and the opportunity cost of taking time away became much higher. I do still want to write something summarizing the advice I got from people around making a book, though, and it’s possible I’ll come back to the book project.)

This post describes in-progress work at the NAO and covers work from a team including Simon Grimm and Asher Parker-Sartori estimating prevalences, Dan Rice modeling, Will Bradshaw evaluating sequencing methods, and Mike McLaren identifying relevant papers and providing general technical guidance.