Sample Prevalence vs Global Prevalence

Cross-posted from my NAO Notebook. Thanks to Evan Fields and Mike McLaren for editorial feedback on this post.

In Detecting Genetically Engineered Viruses With Metagenomic Sequencing we have:

our best guess is that if this system were deployed at the scale of approximately $1.5M/​y it could detect something genetically engineered that shed like SARS-CoV-2 before 0.2% of people in the monitored sewersheds had been infected.

I want to focus on the last bit: “in the monitored sewersheds”. The idea is, if a system like this is tracking wastewater from New York City, its ability to raise an alert for a new pandemic will depend on how far along that pandemic is in that particular city. This is closely related to another question: what fraction of the global population would have to be infected before it could raise an alert?

There are two main considerations pushing in opposite directions, both based on the observation that the pandemic will be farther along in some places than others:

  • With so many places in the world where a pandemic might start, the chance that it starts in NYC is quite low. To take the example of COVID-19, when the first handful of people were sick they were all in one city in China. Initially, prevalence in monitored sewersheds in other parts of the world will be zero, while global prevalence will be greater than zero. This effect should diminish as the pandemic progresses, but at least in the <1% cumulative incidence situations I’m most interested in it should remain a significant factor. This pushes prevalence in your sample population to lag prevalence in the global population.

  • NYC is a highly connected city: lots of people travel between there and other parts of the world. Since pandemics spread as people move around, places with many long-distance travelers will generally be infected before places with few. While if you were monitoring an isolated sewershed you’d expect this factor to cause an additional lag in your sample prevalence, if you specifically choose places like NYC we expect instead the high connectivity to reduce lag relative to global prevalence, and potentially even to lead global prevalence.

My guess is that with a single monitored city, even the optimal one (which one is that even?) your sample prevalence will significantly lag global prevalence in most pandemics, but by carefully choosing a few cities to monitor around the world you can probably get to where it leads global prevalence. But I would love to see some research and modeling on this: qualitative intutitions don’t take us very far. Specifically:

  • How does prevalence at a highly-connected site compare to global prevalence during the beginning of a pandemic?

  • What if you instead are monitoring a collection of highly-connected sites?

  • What does the diminishing returns curve look like for bringing additional sites up? Does it go negative at some point, where you are sampling so many excellent sites that the marginal site is mostly dilutative?

  • If you look at the initial spread of SARS-CoV-2, how much of the variance in when places were infected is explained by how connected they are?

  • What about with data from the spread of influenza and SARS-CoV-2 variants?

  • Are there other major factors aside from connectedness that lead to earlier infection? Can we model how valuable different sites are to sample, in a way that can be combined with how operationally difficult it is to sample in various places?

If you know of good work on these sorts of modeling questions or are interested in collaborating on them, please get in touch! My work email is jeff at

Crossposted from LessWrong (11 points, 0 comments)