Verena Icons_Betacov Black.png
 

The βCoV Reservoir Database

Our June 2020 study used machine learning to predict over 200 species of bat that might be undiscovered hosts of betacoronaviruses - the group of viruses that cause SARS, MERS, and COVID-19, plus dozens of potential future threats. We think that our predictions shouldn’t be a dead end. We need to know: did we do a good job? Which approaches worked best? Is machine learning accurate enough that virologists could use it to guide future sampling? These ideas - accountability, transparency, and open science - are at the core of VERENA: we think science should be done in daylight. So in partnership with several field and laboratory teams, we’re keeping track of whether our model did a good job. (If you think your lab has samples that could be tested, or you published data we missed, reach out to us!)

Here, we’re publicly tracking our model against the real-time discovery of viruses in new bat hosts. Since our study, a total of 24 new hosts of betacoronaviruses have been discovered. Our ensemble correctly identified 15 (65%); the best models so far were Trait-1, which correctly identified 22 of 24 (92%), and Network-1, which correctly identified 15 of 15 in-sample (100%).

Our results are generated by a complex set of machine learning models that are all publicly available on Github. For the methods, we suggest reading the preprint. Here’s what you need to interpret the interface below:

  • Training data: Which bat hosts did we know about when we built the model ensemble? (“Reported” = a known betacoronavirus host as of May 2020; “Unreported” = entirely unsampled or tested negative in the literature.)

  • New data: Which hosts have been discovered since our original study? This includes both (a) records we find in the literature that weren’t found by our GenBank survey; and (b) new records from preprints, published studies, new GenBank accessions, or personal communications.

  • Ensemble: There are eight models in our study (three trait-based, four network-based, and one hybrid approach). The ensemble combines their predictions into one scaled prediction. Each of those eight models, and the ensemble, give a “TRUE” or “FALSE” predicted value. For species known to be betacoronavirus hosts, those are either “true positives” or “false negatives”; for the remainder, species are either “Suspected” or “Not likely.”

  • New predictions: Our predictions re-run with training data + new data (coming soon)

Some context to help you decide if we’re doing a good job or not: Predictions are thresholded by a 10% omission threshold. Without getting to deep into the stats, that means we should expect that if the model is working right, about 10% of known hosts and 10% of unknown hosts will be missed by the model - but we should catch the other 90%.