our best guess is that if this system were deployed at the scale of
approximately $1.5M/​y it could detect something genetically engineered
that shed like SARS-CoV-2 before 0.2% of people in the monitored
sewersheds had been infected.
I want to focus on the last bit: “in the monitored sewersheds”. The idea
is, if a system like this is tracking wastewater from New York City, its
ability to raise an alert for a new pandemic will depend on how far along
that pandemic is in that particular city. This is closely related to
another question: what fraction of the global population would have to be
infected before it could raise an alert?
There are two main considerations pushing in opposite directions, both based
on the observation that the pandemic will be farther along in some places
than others:
With so many places in the world where a pandemic might start, the
chance that it starts in NYC is quite low. To take the example of
COVID-19, when the first handful of people were sick they were all in
one city in China. Initially, prevalence in monitored sewersheds in
other parts of the world will be zero, while global prevalence will
be greater than zero. This effect should diminish as the pandemic
progresses, but at least in the <1% cumulative incidence
situations I’m most interested in it should remain a significant
factor. This pushes prevalence in your sample population to lag
prevalence in the global population.
NYC is a highly connected city: lots of people travel between there
and other parts of the world. Since pandemics spread as people move
around, places with many long-distance travelers will generally be
infected before places with few. While if you were monitoring an
isolated sewershed you’d expect this factor to cause an additional
lag in your sample prevalence, if you specifically choose places like
NYC we expect instead the high connectivity to reduce lag relative to
global prevalence, and potentially even to lead global prevalence.
My guess is that with a single monitored city, even the optimal one (which
one is that even?) your sample prevalence will significantly lag global
prevalence in most pandemics, but by carefully choosing a few cities to
monitor around the world you can probably get to where it leads global
prevalence. But I would love to see some research and modeling on this:
qualitative intutitions don’t take us very far. Specifically:
How does prevalence at a highly-connected site compare
to global prevalence during the beginning of a pandemic?
What if you instead are monitoring a collection of
highly-connected sites?
What does the diminishing returns curve look like for bringing
additional sites up? Does it go negative at some point, where you are
sampling so many excellent sites that the marginal site is mostly
dilutative?
If you look at the initial spread of SARS-CoV-2, how much of the
variance in when places were infected is explained by how connected
they are?
What about with data from the spread of influenza and SARS-CoV-2
variants?
Are there other major factors aside from connectedness that lead
to earlier infection? Can we model how valuable different sites are
to sample, in a way that can be combined with how operationally
difficult it is to sample in various places?
If you know of good work on these sorts of modeling questions or are
interested in collaborating on them, please get in touch! My work email is
jeff at securebio.org.
Sample Prevalence vs Global Prevalence
Cross-posted from my NAO Notebook. Thanks to Evan Fields and Mike McLaren for editorial feedback on this post.
In Detecting Genetically Engineered Viruses With Metagenomic Sequencing we have:
I want to focus on the last bit: “in the monitored sewersheds”. The idea is, if a system like this is tracking wastewater from New York City, its ability to raise an alert for a new pandemic will depend on how far along that pandemic is in that particular city. This is closely related to another question: what fraction of the global population would have to be infected before it could raise an alert?
There are two main considerations pushing in opposite directions, both based on the observation that the pandemic will be farther along in some places than others:
With so many places in the world where a pandemic might start, the chance that it starts in NYC is quite low. To take the example of COVID-19, when the first handful of people were sick they were all in one city in China. Initially, prevalence in monitored sewersheds in other parts of the world will be zero, while global prevalence will be greater than zero. This effect should diminish as the pandemic progresses, but at least in the <1% cumulative incidence situations I’m most interested in it should remain a significant factor. This pushes prevalence in your sample population to lag prevalence in the global population.
NYC is a highly connected city: lots of people travel between there and other parts of the world. Since pandemics spread as people move around, places with many long-distance travelers will generally be infected before places with few. While if you were monitoring an isolated sewershed you’d expect this factor to cause an additional lag in your sample prevalence, if you specifically choose places like NYC we expect instead the high connectivity to reduce lag relative to global prevalence, and potentially even to lead global prevalence.
My guess is that with a single monitored city, even the optimal one (which one is that even?) your sample prevalence will significantly lag global prevalence in most pandemics, but by carefully choosing a few cities to monitor around the world you can probably get to where it leads global prevalence. But I would love to see some research and modeling on this: qualitative intutitions don’t take us very far. Specifically:
How does prevalence at a highly-connected site compare to global prevalence during the beginning of a pandemic?
What if you instead are monitoring a collection of highly-connected sites?
What does the diminishing returns curve look like for bringing additional sites up? Does it go negative at some point, where you are sampling so many excellent sites that the marginal site is mostly dilutative?
If you look at the initial spread of SARS-CoV-2, how much of the variance in when places were infected is explained by how connected they are?
What about with data from the spread of influenza and SARS-CoV-2 variants?
Are there other major factors aside from connectedness that lead to earlier infection? Can we model how valuable different sites are to sample, in a way that can be combined with how operationally difficult it is to sample in various places?
If you know of good work on these sorts of modeling questions or are interested in collaborating on them, please get in touch! My work email is
jeff
atsecurebio.org
.