Nutrition website

Researchers develop website to map regional SARS-CoV-2 clusters in real time

In a recent study published on medRxiv* preprint server, a team of researchers has developed a phylogenetics-based website to quickly and efficiently identify new strains of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in a region.

Study: Identify regional SARS-CoV-2 introductions and transmission clusters in real time. Image Credit: Dotted Yeti / Shutterstock

In the absence of advanced phylogenetic and analytical tools, global SARS-CoV-2 sequencing efforts suffered a setback. Existing methods of phylogenetic analysis could only handle small static data sets. Furthermore, they were too computationally expensive to identify closely related sample clusters and the ever-expanding datasets of densely sampled pathogens, including SARS-CoV-2.

Even when results were available, these analyzes were not easily interpretable for an effective public health response due to a lack of intuitive data visualization and exploration tools. Overall, there is an unmet need for high-throughput tools that could mount an effective public health response by quickly interpreting available data, enabling public officials to take well-informed public health action.

About the study

The regional index (C) was central to the phylogenetically-informed summary heuristic developed for the study. It is a weighted summary of the composition of the descendants of a node in a phylogenetic tree, roughly corresponding to the virus represented by that node inside or outside a specific area.

When a descending leaf is genetically identical to the inner node and is inside a specific region, C is one, or C was zero. The researchers applied additional rules to handle cases where C was undefined. The index calculation does not apply to leaf nodes, for which precise geographic location metadata is not available.

Using this method, the researchers traced SARS-CoV-2 transmission clusters in 102 countries using the Global Parsimony Phylogenetic Tree, constructed from 5,563,847 available SARS-CoV-2 sequences. 2 on GISAID, GenBank and COG-UK25 on November 28, 2021. The size, with approximately 20% distinct regional clusters containing 89% of the samples, appeared highly skewed, suggesting that new viral introductions do not essentially lead to the establishment of a new locally circulating strain.


More than 50% of the samples from the genomic sequence repositories were from the US or the UK, which severely restricted the analysis of global transmission, as inference of a cluster’s origin depends on the robustness original sequencing. Therefore, the researchers focused on US data, where sequencing in each state was relatively complete and robust, and detailed state-level metadata was available for most samples.

As of November 2021, more than 3,00,000 separate clusters of state-level SAR-CoV-2 infections have been discovered in the United States since the start of the pandemic. Of these, 84% of clusters had an attributed origin and 7% of clusters had an international origin, with the majority reflecting transmission in the United States. As one would expect, Mexico and Canada were among the most common regions of international origin, given their long land borders. England was also relatively common as it is well sampled. These results suggest that the sequencing effort in a given region creates a bias in accurately identifying the origin of new clusters.

The most important achievement of this work was the development of Cluster-Tracker, an open source website updated daily. This website assisted in the exploration and prioritization of the latest genomic sequences from across the United States, quickly identifying the clusters most likely to be of interest for public health action. Any user can use this website and its flexible core pipeline to build a similar site for any set of regions (e.g. national level), allowing people to explore SARS-CoV-2 phylogenetic data.


The open-source tools, methodologies and software packages described in the study could prove extremely useful for researchers around the world. Researchers could quickly draw conclusions from large sequence datasets, explore geographic patterns to draw conclusions in the context of the spread of SARS-CoV-2, even other densely sampled pathogens in specific areas of the global phylogeny of SARS-CoV-2. Additionally, this analytical approach performed well on simulated data and was consistent with more sophisticated analysis performed during the pandemic.

More importantly, the researchers presented an accessible open-source interactive interface for their results, which could automatically calculate and display introductions and clusters with each update of the global phylogenetic tree.

In summary, this work will allow public health workers to explore the spread of SARS-CoV-2 in the United States and even help public health groups around the world quickly understand and apply the information obtained from the most recent genomic data.

*Important Notice

medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be considered conclusive, guide clinical practice/health-related behaviors, or treated as established information.