How many viruses do insects have?

A quick story about viruses, numbers, and a really weird coincidence.

Dr. Colin Carlson (Georgetown University)

Photo credit Dr. Anna Fagre

Photo credit Dr. Anna Fagre


A lot of ecological research starts with a simple question: how many?

In my qualifying exam, legendary thing-counter John Harte asked me “how many tree species are there on Barro Colorado Island?” After I struggled to produce an answer, we spent 20 minutes using math to turn my estimate into the answer to another question: “how many tree species do you estimate there should be in the Amazon?”

I did not answer either question correctly. I have never been to either Barro Colorado Island or the Amazon; I have been very few places, and only occasionally taken note of how many tree species were present. But the math I was studying in grad school helped me scrape by with enough of an answer to pass. Like nearly every other question in macroecology, these questions were about the practice of counting uncountable things - not just a “how many?” but “how do you know?” I find endless fascination in this problem: how we estimate species richness based on limited data, how we should measure our own level of confidence, and how we tell lying-with-numbers apart from real science - most importantly, how we tell the difference between those two when we’re the one telling the story.

Counting microbes

A few years ago, I started thinking about how we count parasites. We know surprisingly little about parasite diversity: we think they might be the majority of life on Earth, but only because that’s what reasonable assumptions might tell us. With someday-Verena-colleague Dr. Tad Dallas, I started playing with host-parasite networks looking for a new way to solve the problem. When we tried subsampling those networks, we found something odd [1]: the relationship between host and symbiont diversity almost always follows a power law (the relationship between x and y is, roughly, y = x to the power of k).

Power law.jpg

Symbiont diversity follows a rough power law scaling in plant-seed disperser networks (a), plant-mycorrhizal networks (b), plant-pollinator networks (c), and mammal-roundworm networks (d). We still don’t fully understand why, and that’s probably okay. We still don’t fully know how tape works either and it still works okay, right?

Reproduced from Carlson et al. (2019) Nature Ecology and Evolution.

It turns out that if you extrapolate using these curves, you can estimate parasite diversity based on host diversity. That shortcut works provided you’re willing to tango with three tricky assumptions:

  1. Every host has at least one parasite.

  2. Every host’s full parasite community is recorded in your data.

  3. Yes, it’s actually a power law.

Like any good ecologist, we then immediately look for a way to break all three assumptions. Usually, we can assume the first is true, but there are definitely cases where it might not be, which would lead to over-estimation. This just needs conceptual checking: you could fit a curve between yucca moth and Joshua tree diversity, but if you extrapolated outwards based on the total diversity of all moths, you’d be in danger.

The second one is a bit harder, but if you have one or two really well-sampled hosts, you can multiply your estimates based on (true parasite diversity) / (recorded parasite diversity) for those species. This is a fairly liberal way to overestimate. For example, let’s say our host-parasite dataset is nice and detailed but only captures 20% of the known parasites of domestic cats, which we’ve sampled really well. When we multiply the total diversity estimate by 5, we’re saying “in addition to what we have seen, we’re assuming there are 80% more parasites that are perfectly host specific.” (In practice, that won’t usually be true)

Nematode.jpg

The last one is the tricky one. [2] When we try fitting a curve on a smaller part of the network, we tend again to overestimate, because the power law isn’t quite behaving like a power law should - it’s showing evidence of scale-collapse. (On the right, the same network fit with 10, 25, 50, and 100% of the full network: you get higher diversity estimates with less data.) Soon, we’ll publish some work that gets at why this happens - but for now, just kind of squint at it and hope for the best.

Reproduced from Carlson et al. (2019) Nature Ecology and Evolution.

Each of these three assumptions is a little bit shaky, but together the big biases all tend towards overestimating. That’s good to know, because this method usually produces much lower estimates than if you assume host and parasite diversity scale more like a straight line. We’ve used this method to estimate that there’s probably about 300,000 parasitic worms in vertebrates, which are still 85-95% undescribed in most of the world. We also found that there’s probably only 50,000 mammal viruses, a pretty big revision to the Global Virome Project’s now-ubiquitous estimate of 1.67 million in mammals and waterfowl.

Remember that 1.67 million number for a second.

It’ll come up later.

Why not estimate the diversity of insect viruses?

At its heart, Verena is about using new data with familiar tools to get unexpect answers about how the global virome works. As we understand power law scaling more, and compile better datasets about the vertebrate virome, we’ll be able to broaden that “50,000 mammal viruses” to a better estimate across vertebrates. But what about… everything else?

To estimate insect viral diversity, we need three things:

  1. A dataset compiling known insect-virus associations, to make a network and fit the power law

  2. A total estimate of insect diversity

  3. A single insect species where we’re confident we know the true viral diversity, to get the dataset’s “correction factor” (in the mammal study, we use two species - a macaque and a fruit bat - that have been fully inventoried with metagenomic methods)

As it turns out, the first one fell into our lap in the form of the Ecological Database of the World’s Insect Pathogens. The second is a bit more subjective, but estimates of insect diversity are apparently converging around 6 million or so total species.

The third is a bit trickier, so I took to Twitter, and went with the first [3] response from a Very Smart Person:

eddie.PNG

Then, just a quick jaunt over to NCBI Virus, a quick download of the entirety of GenBank as a .csv, and a little targeted searching to get the total number of Drosophila melanogaster viruses that are recorded in the most comprehensive location. We have all the pieces… let’s get to work:

scalin.jpg

When we fit the power law curve, it is very close to a straight line, as these curves tend to go (power exponent of 0.74) - or, to put it in biological terms, there are a lot of specialist insect viruses. Good - that scans, and means we’re not just picking up the most generalist viruses that always get discovered first. (Or, more plainly: our data isn’t super bad.)

Next, we follow that curve out to the estimated 6 million insects on Earth, and get 166,897.6 estimated insect viruses.

That number is obviously way too low - this is what that correction factor is for. There are thirty clean virus names associated with D. melanogaster in GenBank, and only three clean names in EDWIP - a correction factor of ~10. So if we multiply that out, we estimate that

the ~6 million insect species on Earth should have ~1.67 million virus species

If that number makes something scratch at the back of your brain, it’s because this is the exact same number as the Global Virome Project’s now-outdated viral diversity estimate for mammals and waterfowl.

oh.PNG

What does that mean?

Honestly: nothing.



It’s just a very weird coincidence.



But it is weird, right?

Why is this a blog post?

Because that’s where I left it. During the COVID-19 pandemic, we’ve learned a lot about how much damage One Confident Guy Doing Math can do when he doesn’t have any content knowledge. And even though I’m starting to know a decent amount about viruses, I have absolutely no content knowledge whatsoever about insects. My full range of experience was the time I had one on my face camping in the Sierras. See enclosed:

Insect.jpg

But really - doing this kind of math correctly requires three things: strong theory, good data, and subject area knowledge. Here at the Verena Consortium, we’re working on making #1 and #2 available to everyone. But if you have #3 - if you know about the insect virome and want to help turn this into something we can all be more confident in (and publish) - then drop me an email telling me what I got wrong, and how you would change the analysis. You can get the entire analysis and just pore through it until you find something that needs to change. It could be using a different correction factor than the Drosophila estimate, or using a different dataset of insect viruses - maybe EID2, or maybe you’re sitting on a dataset that hasn’t seen the light of day, which we can help you get out there.

That’s what open science is all about: collaboration, sharing, and becoming less wrong (together).

Notes

[1] Someone else found it first in 2014, using the same dataset we originally did. All credit to them for the discovery. Science is crazy sometimes.

[2] Sometime in the next year or so, we’ll be sharing a much deeper dive into the math, which answers the more formal question: do the number of vertices in a subsampled scale-free bipartite graph follow a scale-free pattern? How does that connect to other macroecological patterns? You’ll love it.

[3] A lot of other smart people responded but what, am I gonna read more than one tweet in a sitting?

Previous
Previous

What we’re doing.