4 TMRCAs for SARS-CoV and SARS-CoV-2. 382, 11991207 (2020). Discovery and genetic analysis of novel coronaviruses in least horseshoe bats in southwestern China. Nat Microbiol 5, 14081417 (2020). 6, 8391 (2015). P.L. Over relatively shallow timescales, such differences can primarily be explained by varying selective pressure, with mildly deleterious variants being eliminated more strongly by purifying selection over longer timescales44,45,46. PDF single centre retrospective study Aside from RaTG13, Pangolin-CoV is the most closely related CoV to SARS-CoV-2. We compiled a set of 69SARS-CoV genomes including 58 sampled from humans and 11 sampled from civets and raccoon dogs. Boni, M.F., Lemey, P., Jiang, X. et al. These datasets were subjected to the same recombination masking approach as NRA3 and were characterized by a strong temporal signal (Fig. Ji, W., Wang, W., Zhao, X., Zai, J. Pangolin was developed to implement the dynamic nomenclature of SARS-CoV-2 lineages, known as the Pango nomenclature. Subsequently a bat sarbecovirusRaTG13, sampled from a Rhinolophus affinis horseshoe bat in 2013 in Yunnan Provincewas reported that clusters with SARS-CoV-2 in almost all genomic regions with approximately 96% genome sequence identity2. is funded by the MRC (no. eLife 7, e31257 (2018). Using these breakpoints, the longest putative non-recombining segment (nt1,88521,753) is 9.9kb long, and we call this region NRR2. 26 March 2020. PubMed Central Virological.org http://virological.org/t/ncov-2019-codon-usage-and-reservoir-not-snakes-v2/339 (2020). Stegeman, A. et al. performed recombination analysis for non-recombining regions1 and 2, breakpoint analysis and phylogenetic inference on recombinant segments. The Sichuan (SC2018) virus appears to be a recombinant of northern/central and southern viruses, while the two Zhejiang viruses (CoVZXC21 and CoVZC45) appear to carry a recombinant region from southern or central China. A third approach attempted to minimize the number of regions removed while also minimizing signals of mosaicism and homoplasy. Our third approach involved identifying breakpoints and masking minor recombinant regions (with gaps, which are treated as unobserved characters in probabilistic phylogenetic approaches). 5 Comparisons of GC content across taxa. Mol. Now, the two researchers used genomic sequencing to compare the DNA of the new coronavirus in humans with that in animals and found a 99% match with pangolins. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic, https://doi.org/10.1038/s41564-020-0771-4. 11,12,13,22,28)a signal that suggests recombinationthe divergence patterns in the Sprotein do not show evidence of recombination between the lineage leading to SARS-CoV-2 and known sarbecoviruses. A second breakpoint-conservative approach was conservative with respect to breakpoint identification, but this means that it is accepting of false-negative outcomes in breakpoint inference, resulting in less certainty that a putative NRR truly contains no breakpoints. Extended Data Fig. 36)gives a putative recombination-free alignment that we call non-recombinant alignment3 (NRA3) (see Methods). For weather, science, and COVID-19 . Centre for Genomic Pathogen Surveillance. Pangolin was developed to implement the dynamic nomenclature of SARS-CoV-2 lineages, known as the Pango nomenclature. Two other bat viruses (CoVZXC21 and CoVZC45) from Zhejiang Province fall on this lineage as recombinants of the RaTG13/SARS-CoV-2 lineage and the clade of Hong Kong bat viruses sampled between 2005 and 2007 (Fig. PureBasic 53 13 constellations Public Python 42 17 Med. Biol. In the variable-loop region, RaTG13 diverges considerably with the TMRCA, now outside that of SARS-CoV-2 and the Pangolin Guangdong 2019 ancestor, suggesting that RaTG13 has acquired this region from a more divergent and undetected bat lineage. Grey tips correspond to bat viruses, green to pangolin, blue to SARS-CoV and red to SARS-CoV-2. In outbreaks of zoonotic pathogens, identification of the infection source is crucial because this may allow health authorities to separate human populations from the wildlife or domestic animal reservoirs posing the zoonotic risk9,10. Using a third consensus-based approach for identifying recombinant regions in individual sequenceswith six different recombination detection methods in RDP5 (ref. Genetic lineages of SARS-CoV-2 have been emerging and circulating around the world since the beginning of the COVID-19 pandemic. Across a large region of the virus genome, corresponding approximately to ORF1b, it did not cluster with any of the known bat coronaviruses indicating that recombination probably played a role in the evolutionary history of these viruses5,7. & Andersen, K. G. Pandemics: spend on surveillance, not prediction. Use of Genomics to Track Coronavirus Disease Outbreaks, New Zealand We thank A. Chan and A. Irving for helpful comments on the manuscript. & Li, X. Crossspecies transmission of the newly identified coronavirus 2019nCoV. 4 we compare these divergence time estimates to those obtained using the MERS-CoV-centred rate priors for NRR1, NRR2 and NRA3. Nguyen, L.-T., Schmidt, H. A., Von Haeseler, A. 2, bottom) show that SARS-CoV-2 is unlikely to have acquired the variable loop from an ancestor of Pangolin-2019 because these two sequences are approximately 1015% divergent throughout the entire Sprotein (excluding the N-terminal domain). Nature 579, 265269 (2020). Share . Zhang, Y.-Z. If the latter still identified non-negligible recombination signal, we removed additional genomes that were identified as major contributors to the remaining signal. 94, e0012720 (2020). 5. Don't blame pangolins, coronavirus family tree tracing could prove key Mol. Conducting analogous analyses of codon usage bias as Ji et al. We infer time-measured evolutionary histories using a Bayesian phylogenetic approach while incorporating rate priors based on mean MERS-CoV and HCoV-OC43 rates and with standard deviations that allow for more uncertainty than the empirical estimates for both viruses (see Methods). For coronaviruses, however, recombination means that small genomic subregions can have independent origins, identifiable if sufficient sampling has been done in the animal reservoirs that support the endemic circulation, co-infection and recombination that appear to be common. Rambaut, A., Lam, T. T., Carvalho, L. M. & Pybus, O. G. Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). This new approach classifies the newly sequenced genome against all the diverse lineages present instead of a representative select sequences. In early January, the aetiological agent of the pneumonia cases was found to be a coronavirus3, subsequently named SARS-CoV-2 by an International Committee on Taxonomy of Viruses (ICTV) Study Group4 and also named hCoV-19 by Wu et al.5. Lu, R. et al. the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Liu, P. et al. Nat. This study provides an integration of existing classifications and describes evolutionary trends of the SARS-CoV . Webster, R. G., Bean, W. J., Gorman, O. T., Chambers, T. M. & Kawaoka, Y. Evolution and ecology of influenza A viruses. Article All custom code used in the manuscript is available at https://github.com/plemey/SARSCoV2origins. J. Virol. A hypothesis of snakes as intermediate hosts of SARS-CoV-2 was posited during the early epidemic phase54, but we found no evidence of this55,56; see Extended Data Fig. Microbes Infect. PLoS Pathog. 4), that region and shorter BFRs were not included in combined putative non-recombinant regions. Nature 538, 193200 (2016). When the first genome sequence of SARS-CoV-2, Wuhan-Hu-1, was released on 10January 2020 (GMT) on Virological.org by a consortium led by Zhang6, it enabled immediate analyses of its ancestry. This is notable because the variable-loop region contains the six key contact residues in the RBD that give SARS-CoV-2 its ACE2-binding specificity27,37. Coronavirus Disease 2019 (COVID-19) Situation Report 51 (World Health Organization, 2020). obtained the genome sequences of 10 SARS-CoV-2 virus strains through nanopore sequencing of nasopharyngeal swabs in Malta and analyzed the assembled genome with pangolin software, and the results showed that these virus strains were assigned to B.1 lineage, indicating that SARS-CoV-2 was widely spread in Europe (Biazzo et al., 2021). ac, Root-to-tip (RtT) divergence as a function of sampling time for the three coronavirus evolutionary histories unfolding over different timescales (HCoV-OC43 (n=37; a) MERS (n=35; b) and SARS (n=69; c)). The variable-loop region in SARS-CoV-2 shows closer identity to the 2019 pangolin coronavirus sequence than to the RaTG13 bat virus, supported by phylogenetic inference (Fig. While there is involvement of other mammalian speciesspecifically pangolins for SARS-CoV-2as a plausible conduit for transmission to humans, there is no evidence that pangolins are facilitating adaptation to humans. Kosakovsky Pond, S. L., Posada, D., Gravenor, M. B., Woelk, C. H. & Frost, S. D. W. Automated phylogenetic detection of recombination using a genetic algorithm. With horseshoe bats currently the most plausible origin of SARS-CoV-2, it is important to consider that sarbecoviruses circulate in a variety of horseshoe bat species with widely overlapping species ranges57. D.L.R. =0.00025. Li, X. et al. The ongoing pandemic spread of a new human coronavirus, SARS-CoV-2, which is associated with severe pneumonia/disease (COVID-19), has resulted in the generation of tens of thousands of virus genome sequences. Lancet 395, 565574 (2020). The S1 protein of Pangolin-CoV is much more closely related to SARS-CoV-2 than to RaTG13. 1c). 874850). B., Weaver, S. & Sergei, L. Evidence of significant natural selection in the evolution of SARS-CoV-2 in bats, not humans. EPI_ISL_410721) and Beijing Institute of Microbiology and Epidemiology (W.-C. Cao, T.T.-Y.L., N. Jia, Y.-W. Zhang, J.-F. Jiang and B.-G. Jiang, nos. Of importance for future spillover events is the appreciation that SARS-CoV-2 has emerged from the same horseshoe bat subgenus that harbours SARS-like coronaviruses. These means are based on the mean rates estimated for MERS-CoV and HCoV-OC43, respectively, while the standard deviations are set ten times higher than empirical values to allow greater prior uncertainty and avoid strong bias (Extended Data Fig. To employ phylogenetic dating methods, recombinant regions of a 68-genome sarbecovirus alignment were removed with three independent methods. By mid-January 2020, the virus was spreading widely within Hubei province and by early March SARS-CoV-2 was declared a pandemic8. In December 2019, a cluster of pneumonia cases epidemiologically linked to an open-air live animal market in the city of Wuhan (Hubei Province), China1,2 led local health officials to issue an epidemiological alert to the Chinese Center for Disease Control and Prevention and the World Health Organizations (WHO) China Country Office. BEAGLE 3: improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics. The inset represents divergence time estimates based on NRR1, NRR2 and NRA3. Of the countries that have contributed SARS-CoV-2 data, 30% had genomes of this lineage. PubMed pango-designation Public Repository for suggesting new lineages that should be added to the current scheme Python 968 73 pangolin Public Software package for assigning SARS-CoV-2 genome sequences to global lineages. Decimal years are shown on the x axis for the 1.2 years of SARS sampling in c. d, Mean evolutionary rate estimates plotted against sampling time range for the same three datasets (represented by the same colour as the data points in their respective RtT divergence plots), as well as for the comparable NRA3 using the two different priors for the rate in the Bayesian inference (red points). Dudas, G., Carvalho, L. M., Rambaut, A. Software package for assigning SARS-CoV-2 genome sequences to global lineages. Phylogenetic Assignment of Named Global Outbreak LINeages, The pangolin web app is maintained by the Centre for Genomic Pathogen Surveillance. 206298/Z/17/Z. However, formal testing using marginal likelihood estimation41 does provide some evidence of a temporal signal, albeit with limited log Bayes factor support of 3 (NRR1), 10 (NRR2) and 3 (NRA3); see Supplementary Table 1. Frontiers | Novel Highly Divergent SARS-CoV-2 Lineage With the Spike Combining regions A, B and C and removing the five named sequences gives us putative NRR1, as an alignment of 63sequences. performed recombination and phylogenetic analysis and annotated virus names with geographical and sampling dates. The construction of NRR1 is the most conservative as it is least likely to contain any remaining recombination signals. When viewing the last 7kb of the genome, a clade of viruses from northern China appears to cluster with sequences from southern Chinese provinces but, when inspecting trees from different parts of ORF1ab, the N. China clade is phylogenetically separated from the S. China clade. (2020) with additional (and higher quality) snake coding sequence data and several miscellaneous eukaryotes with low genomic GC content failed to find any meaningful clustering of the SARS-CoV-2 with snake genomes (a). We say that this approach is conservative because sequences and subregions generating recombination signals have been removed, and BFRs were concatenated only when no PI signals could be detected between them. Temporal signal was tested using a recently developed marginal likelihood estimation procedure41 (Supplementary Table 1). Wang, L. et al. Slider with three articles shown per slide. Katoh, K., Asimenos, G. & Toh, H. in Bioinformatics for DNA Sequence Analysis (ed. Next, we (1) collected all breakpoints into a single set, (2) complemented this set to generate a set of non-breakpoints, (3) grouped non-breakpoints into contiguous BFRs and (4) sorted these regions by length. Here, we analyse the evolutionary history of SARS-CoV-2 using available genomic data on sarbecoviruses. Conservatively, we combined the three BFRs >2kb identified above into non-recombining region1 (NRR1). Maclean, O. Green boxplots show the TMRCA estimate for the RaTG13/SARS-CoV-2 lineage and its most closely related pangolin lineage (Guangdong 2019). Dis. Instead, similarity in codon usage metrics between the SARS-CoV-2 and eukaryotes analyzed was correlated with coding sequence GC content of the eukaryote, with more similar codon usage being identified in eukaryotes with low GC content similar to that of the coronavirus (b). Wang, H., Pipes, L. & Nielsen, R. Synonymous mutations and the molecular evolution of SARS-Cov-2 origins. The Artic Network receives funding from the Wellcome Trust through project no. The shaded region corresponds to the Sprotein. It is available as a command line tool and a web application. A new coronavirus associated with human respiratory disease in China. CNN . On first examination this would suggest that that SARS-CoV-2 is a recombinant of an ancestor of Pangolin-2019 and RaTG13, as proposed by others11,22. Because these subclades had different phylogenetic relationships in regionD (Supplementary Fig. Identification of diverse alphacoronaviruses and genomic characterization of a novel severe acute respiratory syndrome-like coronavirus from bats in China. To begin characterizing any ancestral relationships for SARS-CoV-2, NRRs of the genome must be identified so that reliable phylogenetic reconstruction and dating can be performed. Syst. B.W.P. and T.A.C. Because the SARS-CoV-2 S protein has been implicated in past recombination events or possibly convergent evolution12, we specifically investigated several subregions of the Sproteinthe N-terminal domain of S1, the C-terminal domain of S1, the variable-loop region of the C-terminal domain, and S2. Biol. As informative rate priors for the analysis of the sarbecovirus datasets, we used two different normal prior distributions: one with a mean of 0.00078 and s.d. Wu, Y. et al. 1c). Two exceptions can be seen in the relatively close relationship of Hong Kong viruses to those from Zhejiang Province (with two of the latter, CoVZC45 and CoVZXC21, identified as recombinants) and a recombinant virus from Sichuan for which part of the genome (regionB of SC2018 in Fig. Virus Evol. P.L. The time-calibrated phylogeny represents a maximum clade credibility tree inferred for NRR1.