+ All Categories
Home > Documents > SARS-CoV-2 Genes - MDPI

SARS-CoV-2 Genes - MDPI

Date post: 19-Oct-2021
Category:
Upload: others
View: 15 times
Download: 0 times
Share this document with a friend
21
viruses Article Codon Usage and Phenotypic Divergences of SARS-CoV-2 Genes Maddalena Dilucca 1,2, * ,† , Sergio Forcelloni 1,† , Alexandros G. Georgakilas 3,† , Andrea Giansanti 1,4,† and Athanasia Pavlopoulou 5,6,† 1 Physics Department, Sapienza University of Rome, 00185 Rome, Italy; [email protected] (S.F.); [email protected] (A.G.) 2 Liceo Scientifico Statale Augusto Righi, 00187 Rome, Italy 3 DNA Damage Laboratory, Physics Department, School of Applied Mathematical and Physical Sciences, National Technical University of Athens (NTUA), Zografou Campous, 15780 Athens, Greece; [email protected] 4 INFN Roma1 unit, 00185 Rome, Italy 5 Izmir Biomedicine and Genome Center (IBG), 35340 Balcova, Izmir, Turkey; [email protected] 6 Izmir International Biomedicine and Genome Institute, Dokuz Eylül University, 35340 Balcova, Izmir, Turkey * Correspondence: [email protected] These authors contributed equally to this work. Received: 28 March 2020; Accepted: 27 April 2020; Published: 30 April 2020 Abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which first occurred in Wuhan (China) in December of 2019, causes a severe acute respiratory illness with a high mortality rate, and has spread around the world. To gain an understanding of the evolution of the newly emerging SARS-CoV-2, we herein analyzed the codon usage pattern of SARS-CoV-2. For this purpose, we compared the codon usage of SARS-CoV-2 with that of other viruses belonging to the subfamily of Orthocoronavirinae. We found that SARS-CoV-2 has a high AU content that strongly influences its codon usage, which appears to be better adapted to the human host. We also studied the evolutionary pressures that influence the codon usage of five conserved coronavirus genes encoding the viral replicase, spike, envelope, membrane and nucleocapsid proteins. We found different patterns of both mutational bias and natural selection that affect the codon usage of these genes. Moreover, we show here that the two integral membrane proteins (matrix and envelope) tend to evolve slowly by accumulating nucleotide mutations on their corresponding genes. Conversely, genes encoding nucleocapsid (N), viral replicase and spike proteins (S), although they are regarded as are important targets for the development of vaccines and antiviral drugs, tend to evolve faster in comparison to the two genes mentioned above. Overall, our results suggest that the higher divergence observed for the latter three genes could represent a significant barrier in the development of antiviral therapeutics against SARS-CoV-2. Keywords: coronaviruses; SARS-CoV-2; codon usage bias; mutational bias; natural selection; host adaptation 1. Introduction The name “coronavirus” is derived from the Greek κoρωνα, due to the viruses’ typical shapes being crown-like. The first complete genome of a coronavirus (mouse hepatitis virus—MHV), a positive sense, single-stranded RNA virus, was first reported in 1990 [1]. It belongs to the family Coronaviridae and ranges from 26.4 (ThCoV HKU12) to 31.7 (SW1) kb in genome length [2], having the largest genome among all known RNA viruses, with G + C contents varying from 32% to 43% [3]. The Orthocoronavirinae Viruses 2020, 12, 498; doi:10.3390/v12050498 www.mdpi.com/journal/viruses
Transcript
Page 1: SARS-CoV-2 Genes - MDPI

viruses

Article

Codon Usage and Phenotypic Divergences ofSARS-CoV-2 Genes

Maddalena Dilucca 1,2,*,† , Sergio Forcelloni 1,†, Alexandros G. Georgakilas 3,† ,Andrea Giansanti 1,4,† and Athanasia Pavlopoulou 5,6,†

1 Physics Department, Sapienza University of Rome, 00185 Rome, Italy; [email protected] (S.F.);[email protected] (A.G.)

2 Liceo Scientifico Statale Augusto Righi, 00187 Rome, Italy3 DNA Damage Laboratory, Physics Department, School of Applied Mathematical and Physical Sciences,

National Technical University of Athens (NTUA), Zografou Campous, 15780 Athens, Greece;[email protected]

4 INFN Roma1 unit, 00185 Rome, Italy5 Izmir Biomedicine and Genome Center (IBG), 35340 Balcova, Izmir, Turkey;

[email protected] Izmir International Biomedicine and Genome Institute, Dokuz Eylül University, 35340 Balcova, Izmir, Turkey* Correspondence: [email protected]† These authors contributed equally to this work.

Received: 28 March 2020; Accepted: 27 April 2020; Published: 30 April 2020

Abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which first occurred inWuhan (China) in December of 2019, causes a severe acute respiratory illness with a high mortalityrate, and has spread around the world. To gain an understanding of the evolution of the newlyemerging SARS-CoV-2, we herein analyzed the codon usage pattern of SARS-CoV-2. For this purpose,we compared the codon usage of SARS-CoV-2 with that of other viruses belonging to the subfamilyof Orthocoronavirinae. We found that SARS-CoV-2 has a high AU content that strongly influences itscodon usage, which appears to be better adapted to the human host. We also studied the evolutionarypressures that influence the codon usage of five conserved coronavirus genes encoding the viralreplicase, spike, envelope, membrane and nucleocapsid proteins. We found different patterns of bothmutational bias and natural selection that affect the codon usage of these genes. Moreover, we showhere that the two integral membrane proteins (matrix and envelope) tend to evolve slowly byaccumulating nucleotide mutations on their corresponding genes. Conversely, genes encodingnucleocapsid (N), viral replicase and spike proteins (S), although they are regarded as are importanttargets for the development of vaccines and antiviral drugs, tend to evolve faster in comparison tothe two genes mentioned above. Overall, our results suggest that the higher divergence observed forthe latter three genes could represent a significant barrier in the development of antiviral therapeuticsagainst SARS-CoV-2.

Keywords: coronaviruses; SARS-CoV-2; codon usage bias; mutational bias; natural selection;host adaptation

1. Introduction

The name “coronavirus” is derived from the Greek κoρωνα, due to the viruses’ typical shapesbeing crown-like. The first complete genome of a coronavirus (mouse hepatitis virus—MHV), a positivesense, single-stranded RNA virus, was first reported in 1990 [1]. It belongs to the family Coronaviridaeand ranges from 26.4 (ThCoV HKU12) to 31.7 (SW1) kb in genome length [2], having the largest genomeamong all known RNA viruses, with G + C contents varying from 32% to 43% [3]. The Orthocoronavirinae

Viruses 2020, 12, 498; doi:10.3390/v12050498 www.mdpi.com/journal/viruses

Page 2: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 2 of 21

sub-family consists of four genera based on their genetic properties: Alphacoronavirus, Betacoronavirus(subdivided in subgroups A, B, C and D), Gammacoronavirus and Deltacoronavirus. Coronaviruses caninfect humans and diverse animal species, including swine, cattle, horses, camels, cats, dogs, rodents,birds, bats, rabbits, ferrets, minks, snakes and other wildlife animals.

In this study, we have focused on 30 coronavirus (CoV) genomes: 28 viruses from Woo et al.(2010) [4]; the Middle East respiratory syndrome coronavirus (MERS-CoV), which appeared for the firsttime in 2012; and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which just brokeout in Wuhan (China) in December of 2019. Only seven CoVs have been identified that infect humans.Two coronaviruses that cause relatively mild respiratory symptoms have been known of since the 1960sthat is, human CoV-229E (HCoV-229E) and human CoV-OC43 (HCoV-OC43). Human severe acuterespiratory syndrome coronavirus (SARSr-CoV) was identified in 2003, and it causes a more severerespiratory syndrome [5]. The human coronavirus NL63 (HCoV-NL63) was first identified in 2004 andit causes respiratory symptoms in humans [6]; the fifth member, human CoV-HKU1 (HCoV-HKU1) wasdescribed in 2005 [7]. More recently, the pathogenic Middle East respiratory syndrome (MERS-CoV)coronavirus was identified as the sixth human coronavirus [8]. Finally, the present outbreak of acoronavirus-associated acute respiratory disease called coronavirus disease 19 (COVID-19) is causedby human SARS-CoV-2 infections [9,10].

The newly sequenced SARS-CoV-2 genome encodes two open reading frames (ORFs),ORF1a and ORF1ab. The latter encodes replicase polyproteins, and four structural proteins [11,12];namely, the spike-surface glycoprotein (protein S), the small envelop protein (protein E), the matrixprotein (M) and the nucleocapsid protein (N).

The phenomenon of codon usage bias (CUB) exists in many genomes, including RNA genomes,and it is actually determined by mutation and selection [13–15]. The non-random selection ofsynonymous codons is known to vary among species that are potential hosts for viruses [16]. It istherefore important to study patterns of common codon usage in coronaviruses because CUB canbe related to the driving forces that shape the evolutions of small RNA viruses. Mutational bias hasbeen considered as the major determinant of codon usage variation among RNA viruses [17]. Indeed,RNA viruses show an effective number of codons (ENC) that is quite high (ENC > 45), pointing toquite random codon usage, whereas the adaptive index CAI indicates that the viral CUB is consistentwith that of the host, as observed in the Equine infectious anemia virus (EIAV) or Zaire ebolavirus(ZEBOV) [18].

The aims of this study were to perform a comprehensive analysis of the nucleotide composition,codon usage and rate of protein divergence of SARS-CoV-2, and to thereby draw inferences regardingits leading evolutionary determinants.

2. Materials and Methods

2.1. Sequence Data Acquisition

The complete coding genomic sequences of 306 isolates of SARS-CoV-2 reported across the worldto date, were obtained from GISAID (available at https://www.gisaid.org/epiflu-applications/next-hcov-19-app/) and NCBI viral databases, accessed as of 17 March 2020. Then the sequences wereselected according to their geographical distributions, isolation dates and host species.

In this study, we explored 30 CoV genomes: 28 viruses from Woo et al. (2010) [4]; the Middle Eastrespiratory syndrome coronavirus (MERS-CoV); and the severe acute respiratory syndrome-relatedcoronavirus 2 (SARS-CoV-2). We downloaded the coding sequences of these coronaviruses from theNational Center for Biotechnological Information (NCBI) (available at https://www.ncbi.nlm.nih.gov/). For each virus, we investigated the following genes (shown in alphabetical order): E, M, N,RdRP and S.

Page 3: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 3 of 21

2.2. Nucleotide Composition Analysis

The diverse nucleotide compositional properties were calculated for the coding sequences ofthe 30 CoV genomes. These compositional properties comprise the frequencies of occurrence ofeach nucleotide (A, U, G and C); AU and GC contents; and nucleotides G + C at the first (GC1),second (GC2) and third codon positions (GC3). To calculate these values, we used an in-house Pythonscript. We calculated, also, the mean frequencies of nucleotides G + C at first and second positions (GC12).

2.3. RSCU

RSCU vectors for all the genomes were computed by using an in-house Python script,following the formula:

RSCUi =Xi

1Ni

∑nij=1 Xj

(1)

In the RSCUi, Xi is the number of occurrences in a given genome of codon i, and the sum inthe denominator runs over its ni synonymous codons. If the RSCU value for a codon i is equal to 1,this codon has been chosen equally and randomly. Codons with RSCU values greater than 1 havepositive codon usage bias, while those with a value less than 1 have relatively negative codon usagebias [19]. RSCU heat maps were drawn with the CIMminer software [20], which uses Euclideandistances and the average linkage algorithm.

2.4. Effective Number of Codons Analysis

ENC is an estimate of the frequency of different codons used in a coding sequence. In general,ENC ranges from 20 (when each amino acid is coded by the same codon) to 61 (when all synonymouscodons are used on an equal footing). Given a sequence of interest, the computation of ENC startsfrom Fα, a quantity defined for each family α of synonymous codons (one for each amino acid):

Fα =

(nkα

)2(2)

where mα is the number of different codons in α (each one appearing n1α , n2α , ..., nmα times in thesequence) and nα = ∑mα

k=1 nkα.

ENC then weights these quantities on a sequence:

ENC = Ns +K2 ∑K2

α=1 nα

∑K2α=1(nαFα)

+K3 ∑K3

α=1 nα

∑K3α=1(nαFα)

+K4 ∑K4

α=1 nα

∑K4α=1(nαFα)

(3)

where NS is the number of families with one codon only and Km is the number of families withdegeneracy m (the set of 6 synonymous codons for Leu can be split into one family with degeneracy2, similar to that of Phe, and one family with degeneracy 4, similar to that, e.g., of Pro). ENC wasevaluated by using the implementation in DAMBE 5.0 [21].

2.5. Codon Adaptation Index

The codon adaptation index CAI [22] was used to quantify the codon usage similarities betweenthe virus and host coding sequences. The principle behind CAI is that codon usage in highlyexpressed genes can reveal the optimal (i.e., most efficient for translation) codons for each amino acid.

Page 4: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 4 of 21

Hence, CAI is calculated based on a reference set of highly expressed genes to assess, for each codon i,the relative synonymous codon usages (RSCUi) and the relative codon adaptiveness (wi):

RSCUi =Xi

1ni

ni∑

j=1Xj

; wi =RSCUi

maxj=1,...,ni

{RSCUj}; (4)

In the RSCUi, Xi is the number of occurrences of codon i in the genome, and the sum in thedenominator runs over the ni synonyms of i; RSCUs thus measures codon usage bias within a familyof synonymous codons. Then wi is then defined as the usage frequency of codon i compared to thatof the optimal codon for the same amino acid encoded by i—(i.e., the one which is mostly used ina reference set of highly expressed genes). The CAI for a given gene g is calculated as the geometricmean of the usage frequencies of codons in that gene, normalized to the maximum CAI value possiblefor a gene with the same amino acid composition:

CAIg =

lg

∏i=1

wi

1/lg

, (5)

where the product runs over the lg codons belonging to that gene (except the stop codon).This index values range from 0 to 1, where the score 1 represents the tendency of a gene to use the

most frequently used synonymous codons in the host. The CAI analysis of these coding sequences isperformed using DAMBE 5.0 [21]. The synonymous codon usage data of different hosts (human andother species) were retrieved from the codon usage database (http://www.kazusa.or.jp/codon/).

To study the patterns of codon biases in the coronaviruses, we used Z-score values:

Zv[(ENC)] =〈ENC〉CoV − 〈ENC〉v

σv/√

Nv, (6)

where 〈ENC〉CoV is the average of the ratio within a codon bias index in a coronavirus v, 〈ENC〉v,and σv is the average value of ENC and its standard deviation over the whole virus v; and Nv is thenumber of viruses (we use the standard deviation of the mean when comparing average values).The same Z-score was evaluated for codon bias index CAI.

2.6. The Similarity Index

The similarity index (SiD) provides a measure of similarity in codon usage between the virus(in our case, SARS-CoV-2) and the host under study. Formally, it is defined as follows:

R(a, b) =∑59

k=1 ai · bi√∑59

k=1 a2i ·∑

59k=1 b2

i

(7)

SiD =1− R(a, b)

2(8)

where ai is the RSCU value of 59 synonymous codons of the SARS-CoV-2 coding sequences; bi is theRSCU value of the identical codons of the potential host. R(a,b) is defined as the cosine value of theangle included between A and B spatial vectors, and therefore, quantifies the degree of similaritybetween the virus and the host in terms of their codon usage patterns. In our analysis, we consideredthe hosts species shown in Table 1 by Woo et al. [4]. We also considered snakes and pangolins,because they were previously identified as possible candidates for the novel coronavirus spillover intohumans [9]. SiD values range from 0 to 1. Specifically, the higher the value of SiD, the more adaptedthe codon usage of SARS-CoV-2 to the host [23].

Page 5: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 5 of 21

2.7. ENC Plot

ENC-plot analysis was performed to estimate the relative contributions of mutational bias andnatural selection in shaping CUB of genes encoding proteins that are crucial for SARS-CoV-2: RdRP,the spike-surface glycoprotein (protein S), the small envelop protein (protein E), the matrix protein(M) and the nucleocapsid protein (N). The ENC-plot is a plot in which ENC is the ordinate and theGC-content in the third codon position (GC3) is the abscissa. Depending on the action of mutationalbias and natural selection, different cases are discernable. If a gene is not subject to selection, a clearrelationship is expected between ENC and GC3 [24]:

ENC = 2 + s +29

s2 + (1− s)2 (9)

where s represents the value of GC3 [24]. For those genes, codon preference, determined onlyby mutational bias, is expected to lie on or just below Wright’s theoretical curve. Alternatively,if a particular gene is subject to selection, then it falls below Wright’s theoretical curve. In this case,the vertical distance between the point and the theoretical curve provides an estimation of the relativeextent to which natural selection and mutational bias affect CUB.

To evaluate the dots scattering from Wright’s theoretical curve, we calculated the module ofdistance, and the box plots were drawn with an in-house Python script.

2.8. Neutrality Plot

We performed neutrality plot analysis [25] to estimate the relative contributions of natural selectionand mutational bias in shaping the CUBs of five crucial coronavirus genes in the research field aimingto develop a vaccine against SARS-CoV-2: M, N, S, RdRP and E. In this analysis, the GC1 or GC2 values(ordinate) were plotted against the GC3 values (abscissa), and each gene was represented as a singlepoint on this plane. In this case, the three stop codons (UAA, UAG and UGA) and the three codons forisoleucine (AUU, AUC and AUA) were excluded from the calculation of GC3, and two single codonsfor methionine (AUG) and tryptophan (UGG) were excluded in all three (GC1, GC2 and GC3) [25].

For each gene, we separately performed a Spearman correlation analysis between GC1 and GC2with the GC3. If the correlation between GC12 and GC3 is statistically significant, the slope of theregression line provides a measure of the relative extent to which natural selection and mutationalbias affect the CUBs of these genes (Sueoka 1999). In particular, if the mutational bias is the drivingforce that shapes the CUB, then the corresponding data points should be distributed along the bisector(slope of unity). On the other hand, if natural selection also affects the codon choice of a family ofgenes, then the corresponding regression line should diverge from the bisector. Thus, the divergencebetween the regression line and bisector quantifies the extent of codon usage preference due to thenatural selection.

2.9. Forsdyke Plot

To study the mutational rates of genes M, N, S, RdRP and E, we performed an analysis by using ourpreviously defined Forsdyke plot [26]. Each gene in SARS-CoV-2 (used as a reference) was comparedto its orthologous gene in the 30 coronaviruses considered in this analysis. Each pair of orthologousgenes is represented by a point in the Forsdyke plot, where protein divergence is correlated with DNAdivergence (see Methods in [26] for details). The protein sequences were aligned using Biopython.The DNA sequences were then aligned using the protein alignments as templates.

Then, both DNA and protein divergences were assessed as explained in Methods in [26] bycounting the number of mismatches in each pair of aligned sequences. Thus, each point in the Forsdykeplot measures the divergence between pairs of orthologous genes in the two species, as projectedalong with the phenotypic (protein) and nucleotidic (DNA) axis. The first step in each comparisonis to compute the regression line between protein vs. DNA sequence divergence in the Forsdykeplot getting values of intercept and slope for each variant of genes (i.e., M, N, S, RdRP and E). To test

Page 6: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 6 of 21

whether the regression parameters associated with each variant are different or not, we followeda protocol founded by Dilucca et al., considering a p-value ≤ 0.05.

2.10. Phylogenetic Analysis

To explore the evolutionary relationships among the four genera of coronaviruses, phylogeneticanalysis of the full-length genomic sequences of the 30 CoVs listed in Table 1 was performed.The sequences were aligned with the usage of ClustalO [27,28]. The resulting multiple sequence alignmentwas used to build a phylogenetic tree by employing a maximum likelihood (ML) method implementedin the software package MEGA version 10. 1 [29]. ModelTest-NG [30] was used to select the best-fitevolutionary model of nucleotide substitution; that is, GTR + G + I. Bootstrap analysis (100 pseudo-replicates) was conducted in order to evaluate the statistical significance of the inferred trees.

3. Results

3.1. Nucleotide Composition

We calculated the nucleotide compositions of the coronavirus genomes under study (see Table 1).Previous results showed that the gene N, which follows the trend A > U > G > C [12] and thecoronavirus RNA genomes are biased towards high AU content and low GC content [31]. In line withthat, our results show that the nucleotide A is the most frequent base and the nucleotide compositionfollows the trend A > U > G > C (see Table 2). Interestingly, SARS-CoV-2 has a nucleotide compositionthat is similar to the other CoVs but with a different trend U> A > G > C. The GC content inSARS-CoV-2 is 0.37 ± 0.05.

3.2. All the Sequenced SARS-CoV-2 Genomes Share a Common Codon Usage

We downloaded the protein-coding sequences of SARS-CoV-2 from GISAID database,and classified each SARS-CoV-2 based on the geographic location in which it was sequenced (see treein Figure A1). For each SARS-CoV-2 genome, we calculated the relative synonymous codon usage(RSCU), in the form of a 61-component vector. The heatmap and the associated clustering of thesevectors are shown in Figure A2. We noted that the overall codon usage bias among SARS-CoV-2strains appears to be similar. Moreover, their associated RSCU vectors did not cluster according togeographic location, thereby confirming the common origin of these genomes. Motivated by theseobservations, we considered a unique vector to represent the codon usage of SARS-CoV-2 in thefollowing analyses.

3.3. Codon Usage of SARS-CoV-2

We compared the codon usage of SARS-CoV-2 with that of the other coronavirus genomes. For thispurpose, we used the RSCU, which is a biologically relevant metric of the distance between the codonusage in the protein-coding sequences of these genomes. The heatmap of the RSCU values associatedwith the coronaviruses is shown in Figure 1. The RSCU values of the majority of the codons scoredbetween 0 and 3.1 (see legend in Figure 1). Interestingly, the newly identified SARS-CoV-2 Wuhan-Hu-1coronavirus clusters with the other two human coronaviruses SARSr-CoV and HCoV-229E. Moreover,in this heatmap, HCoV-HKU1 and HCoV-NL63 cluster together, consistent with viral adaptation totheir host.

In line with previous observations, we show that the mean CpG relative abundance in thecoronavirus genomes is markedly suppressed [32]. Specifically, GGG, GGC, CCG (pyrimidine-CpG)and ACG (purine-CpG) present low frequencies of occurrence, probably due to the relative tRNAabundance of the host. In SARS-CoV-2, the most frequently used codons are CGU (arginine, 2.34 times)and GGU (glycine, 2.42), whereas the least used codons are GGG (glycine) and UCG (serine). Of note,the most frequently used codons for each amino acid end with either U or A [18].

Page 7: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 7 of 21

Figure 1. Clustering of the relative synonymous codon usage (RSCU) vectors associated with the 30coronaviruses. Human coronaviruses are shown in red. The newly identified SARS-CoV-2 coronavirusis closer to HCoV-229E and SARSr-CoV in terms of codon usage, as measured by their RSCU vectors.Heatmap was drawn with the CIMminer software [20], which uses Euclidean distances and the averagelinkage algorithm.

3.4. The Codon Usage of SARS-CoV-2 in Relation to the Human Host

To measure the codon usage bias in the coronavirus genomes, we used the effective number ofcodons (ENC) and the competition adaptation index (CAI). For each coronavirus, we calculated theaverage values of CAI and ENC associated with its genes. In Table 3 the ENC and CAI values for allthe coronaviruses considered in this work are reported. To visually enhance the differences among thecodon usage of these coronaviruses, we calculated the Z-score value of each virus with respect to theaverage values of ENC and CAI calculated for all 30 coronaviruses.

The human coronaviruses show different patterns of codon usage (Figure 2). With the exceptionof HCoV-OC43, all the human coronaviruses have ENC and CAI values that are significantly differentfrom the average values of ENC and CAI calculated for all 30 coronaviruses (|Z-score| > 3).Specifically, the ENC value associated with SARS-CoV-2 (51.9 ± 2.59) is significantly higher thanthe average of all coronaviruses (50.09 ± 1.32), indicating that SARS-CoV-2 uses a broader set ofsynonymous codons in its coding sequences. Moreover, the CAI of SARS-CoV-2 (0.727 ± 0.054) ismarkedly higher than the average one (0.69 ± 0.024), underscoring that SARS-CoV-2 uses codonsthat are better adapted to its host. Moreover, the CAI of SARS-CoV-2 is significantly higher thanthe CAI of the other human CoVs in the subfamily, thereby suggesting a greater adaptation to thehuman host for SARS-CoV-2 compared to the other coronaviruses. Finally, the ENC values of the threemost pathogenic HCoVs having Z-scores > 3 (SARS-CoV, SARS-CoV-2 and MERS) are on average,higher than the ENCs of the other four HCoVs, which have instead Z-scores < −3. This higher CUB interms of ENCs of the four HCoVs reinforces their strong adaptiveness to humans, as they have beencirculating in the population for a long time and are now less pathogenic.

To better clarify the origin of SARS-CoV-2 and its optimization to the human host, we thencalculated the average CAI for the SARS-CoV-2 genes by using different reference hosts (Figure 3).

Interestingly, snake and human hosts correspond to the highest values of CAI, indicatingthat SARS-CoV-2 uses codons that are better optimized to these two organisms. Although ourresults suggest a possible origin of SARS-CoV-2 from snakes and its spillover into humans [33],previous studies do not support this hypothesis [34,35].

Page 8: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 8 of 21

Figure 2. Z-score values. Z-score is calculated for two codon bias indeces: effective number of codons(ENC) and the competition adaptation index (CAI). CAI values are calculated by considering the hostsspecified in Table 3 by Woo et al. [4]. Regarding SARS-CoV-2, we considered a human host. In red,we show the human coronaviruses. Several coronaviruses have a codon usage preference values higherthan the average value of the family (|Z-score| > 3). The statistically significant differences are markedwith asterisks. In particular, SARS-CoV-2 genes have average values of CAI and ENC that are higherthan the average of all coronaviruses. (*): |Z-score| > 3.

Similarly, to corroborate this observation, we also calculated the similarity index (SiD) ofSARS-CoV-2 for the hosts reported in Figure 3 (see Figure A4). SiD values range from 0 to 1; the higherthe value of SiD, the more adapted the codon usage of SARS-CoV-2 to the host [23]. Since recent studieshave revealed multiple lineages of Malayan pangolin (Manis javanica) coronavirus that are similar toSARS-CoV-2 [36], we also added this organism in the present analysis. CAI was not calculated forpangolin because its genome is not well-annotated, and the five genes under investigation (M, N, S,E and RdRp) are not available. SiD values range from 0.23 (in rabbit) to 0.78 (in human). Notably,this analysis not only confirms our previous observation (see Figure 3) that SARS-CoV-2 uses codonsthat are better optimized to snakes (SiD = 0.75) and humans (SiD = 0.78), but reveals the same forpangolins (SiD = 0.76), bats (SiD = 0.70 ), and rats (SiD = 0.71), which are also possible hosts forSARS-CoV-2 [9].

Figure 3. CAI values of SARS-CoV-2 for different hosts. On the horizontal axis, the 12 eukaryoticspecies are shown that were considered in the comparisons. The host species are ranked in ascendingorder. CAI values for snake and human hosts are higher than those for other hosts.

Page 9: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 9 of 21

Table 1. Coronaviruses under study. Name, abbreviation, NCBI genome accession code and size (inbp) for each virus are reported.

Family Name Abbreviation NCBI Code

alphacoronavirus Feline infectious peritonitis virus FIPV NC_002306.3alphacoronavirus Human coronavirus 229E HCoV-229E NC_002645.1alphacoronavirus Human coronavirus NL63 HCoV-NL63 NC_005831.2alphacoronavirus Miniopterus bat coronavirus 1A Mi-BatCoV 1A NC_010437.1alphacoronavirus Miniopterus bat coronavirus 1B Mi-BatCoV 1B EU420137.1alphacoronavirus Miniopterus bat coronavirus HKU8 Mi-BatCoV HKU8 NC_010438.1alphacoronavirus Porcine epidemic diarrhea virus PEDV NC_003436.1alphacoronavirus Porcine respiratory coronavirus PRCV DQ811787.1alphacoronavirus Rhinolophus bat coronavirus HKU2 Rh-BatCovHKU2 NC_009988.1alphacoronavirus Scotophilus bat coronavirus 512 Sc-BatCoV 512 NC_009657.1alphacoronavirus Transmissible gastroenteritis virus TGEV NC_038861.1

betacoronavirus Bovine coronavirus BCoV NC_003045.1betacoronavirus Equine coronavirus ECoV LC061274.1betacoronavirus Human coronavirus HKU1 HCoV-HKU1 NC_006577.2betacoronavirus Human coronavirus OC43 HCoV-OC43 NC_006213.1betacoronavirus Mouse hepatitis virus MHV NC_001846.1betacoronavirus Porcine hemagglutinating encephalomyelitis virus PHEV DQ011855.1betacoronavirus Severe acute respiratory syndrome-related coronavirus 2 SARS-CoV-2 NC_045512.2betacoronavirus Severe acute respiratory syndrome-related coronavirus SARSr-CoV NC_004718.3betacoronavirus SARS-related Rhinolophus bat coronavirus HKU3/ SARSr-Rh-BatCoV HKU3 NC_009694.1betacoronavirus Middle East respiratory syndrome-related coronavirus MERS-CoV NC_019843.3betacoronavirus Bat coronavirus HKU9-1 Ro-BatCoV HKU9 NC_009021.1betacoronavirus Pipistrellus bat coronavirus HKU5 Pi-BatCoV HKU5 NC_009020.1betacoronavirus Tylonycteris bat coronavirus HKU4 Ty-BatCoV HKU4 NC_009019.1

gammacoronavirus Avian infectious bronchitis virus IBV NC_001451.1gammacoronavirus Beluga whale coronavirus SW1 SW1 NC_010646.1gammacoronavirus Turkey coronavirus TCoV NC_010800.1

deltacoronavirus Bulbul coronavirus HKU11-934 BuCoV HKU11 NC_011547.1deltacoronavirus Munia coronavirus HKU13-3514 MunCoV HKU13 NC_011550.1deltacoronavirus Thrush coronavirus HKU12-600 ThCoV HKU12 NC_011549.1

Table 2. Statistics of SARS-CoV-2.

A C G U

ObsN 12688 7693 8393 13709Freq. 0.30 0.18 0.20 0.32

3.5. Selective Pressures and Mutational Rates Characterizing Five Conserved Coronavirus Genes

The genome of the newly emerging SARS-CoV-2 consists of a single, positive-stranded RNA,which is approximately 30,000 nucleotides long. The newly sequenced SARS-CoV-2 genome isorganized similarly to the other coronavirus genomes. Ceraolo et al. performed a cross-speciesanalysis for all proteins encoded by SARS-CoV-2 (see Figures 3 and 4 in [37]). It encodes polyproteinscommon to all betacoronaviruses which are further cleaved into the individual structural proteinsE, M, N and S, and the non-structural RdRP [38]. Thus, only five viral genes, classified according totheir viral locations, were studied for each virus, because the short length and insufficient codon usagediversity of the other genes might have biased our results.

The corresponding gene products are involved in essential viral functions. Briefly, S proteinregulates viral attachment to the receptor of the target host cell [39]; E protein functions to assemblethe virions and acts as an ion channel [40] M protein plays a role in viral assembly and is involvedin the biosynthesis of new virus particles [41]; N protein forms the ribonucleoprotein complex withthe viral RNA [12]; RdRP catalyzes viral RNA synthesis. For these five proteins the RSCU vectors ineach virus of the dataset are shown in Figures 4 and A5. We showed that SARS-CoV-2 clusters withSARSr-CoV and SARSr-Rh-BatCoV HKU3, only for genes E, M and N, consistent with the inferredphylogeny shown in Figure A3.

Page 10: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 10 of 21

Table 3. Codon usage biases of different coronaviruses under study.

Abbr. ENC CAI

BCoV 52.10± 2.36 0.69 ± 0.04BuCoV HKU11 51.41 ± 1.85 0.68 ± 0.04

ECoV 49.31 ± 4.02 0.691 ± 0.02FIPV 51.56 ± 1.99 0.67 ± 0.048

HCoV-HKU1 44.58 ± 7.33 0.67 ± 0.02HCoV-229E 50.29 ± 3.62 0.68 ± 0.02HCoV-NL63 44.67± 5.35 0.66 ± 0.03HCoV-OC43 49.57 ± 3.66 0.692 ± 0.02

IBV 50.65 ± 2.90 0.65 ± 0.05MERS-CoV 53.08 ± 2.53 0.69 ± 0.03

MHV 53.62 ± 1.72 0.71 ± 0.02Mi-BatCoV 1A 48.23± 3.81 0.68 ± 0.03Mi-BatCoV 1B 49.31 ± 4.11 0.68± 0.03

Mi-BatCoV HKU8 50.12 ± 4.14 0.70 ± 0.02MunCoV HKU13 53.96 ± 0.86 0.69± 0.04

PEDV 52.44 ± 2.153 0.68± 0.04PHEV 51.09± 3.553 0.68 ± 0.02

Pi-BatCoV HKU5 53.91 ± 1.36 0.70 ± 0.04PRCV 51.27 ± 3.15 0.67 ± 0.03

Rh-BatCovHKU2 48.08 ± 4.49 0.70 ± 0.02Ro-BatCoV HKU9 50.91 ± 2.31 0.68 ± 0.03

SARSr-CoV 53.64 ± 2.43 0.67 ± 0.04SARS-CoV-2 51.98 ± 2.59 0.72 ± 0.05

SARSr-Rh-BatCoV HKU3 54.30± 1.61 0.67 ± 0.03Sc-BatCoV 512 52.38 ± 2.63 0.68 ± 0.04

SW1 50.86± 1.791 0.70 ± 0.03TCoV 51.34 ± 2.31 0.66 ± 0.05TGEV 51.39 ± 3.27 0.67 ± 0.04

ThCoV HKU12 51.43 ± 2.83 0.68 ± 0.03Ty-BatCoV HKU4 50.37 ± 3.74 0.68 ± 0.03

3.6. The ENC Plot Analysis of Individual Genes of SARS-CoV-2

To further investigate which factors account for the low codon usage bias of the coronavirus genes,we analyzed the relationship between the ENC value and the percentage of G or C in the thirdcodon position (GC3s). The ENC-plots obtained for the five genes (M, N, S, E and RdRP) are shownseparately together with Wright’s theoretical curve (Figure 5), denoting that GC3s is only determinedexclusively by codon usage [24]. Thus, if mutational bias, as quantified by GC-content in the generallyneutral third codon position, is the main factor in determining the codon usage among these genes,the corresponding point in the ENC-plot should lie on or just below Wright’s curve. In Figure 5,all distributions lie below the theoretical curve, an indication that not only mutational bias but alsonatural selection play non-negligible roles in the codon choices in all genes. This is also exemplified bythe violin plots in Figure 6 showing the distances between the genes and Wright’s theoretical curve inthe ENC-plot.

Genes N, S and RdRP are more scattered below the theoretical curve than genes M and E, implyingthat in the latter the codon usage patterns are pretty consistent with the effects of mutational bias.Interestingly, data points corresponding to the gene N, which is the major viral structural componentneeded to protect and encapsidate the viral RNA, are clustered more closely around GC3 = 0.5 (seeFigure 5). This means that the displacement under Wright’s theoretical curve most likely reflects theselective pressure exerted on this gene. Conversely, all other genes show a displacement towards lowervalues of GC3-content, thereby corroborating our previously mentioned observation that coronavirusestend to use codons that end with A and U (see Section 3.3).

Page 11: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 11 of 21

Gene E

Gene M

Gene NFigure 4. RSCU vectors of three different coronavirus genes. Heatmaps confirm that the RSCUpatterns of the newly identified coronavirus SARS-CoV-2 sequence are more related to those ofSARSr-CoV and SARSr-Rh-BatCoV HKU3 for genes E, M and N.

Page 12: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 12 of 21

Figure 5. ENC-plots of genes M, N, S, E and RdRP. In these plots, each point corresponds to a singlegene. The black-dotted lines in all panels are plots of Wright’s theoretical curve corresponding tocodon usage biases (CUBs) that occur merely due to mutational bias (no selective pressure). Red dotsrepresent SARS-CoV-2 genes.

Figure 6. Violin plots of the distances of genes M, N, S, E and RdRP from Wright’s theoretical curve.

3.7. Neutrality Plot of Individual Genes of SARS-CoV-2

A neutrality plot analysis was performed to estimate the role of mutational bias and naturalselection in shaping the codon usage patterns of the five genes under investigation. In this plot,the average GC-content in the first and second positions of codons (GC12) is plotted against GC3s,which is considered as a pure mutational parameter. In Figure 7, the neutrality plots obtained for genesM, N, S, E and RdRP, together with the best-fit lines and the slopes associated with them are shown.

To understand the rationale behind these results: the wider the deviation between the slope of theregression line and the bisector, the stronger the action of selective pressure. All correlations are highlysignificant (Spearman correlation—R2 analysis, p-value < 0.0001). By comparing the divergencesbetween the regression lines and the bisectors in each panel, we reveal that the five genes consideredherein depend on a balance between natural selection and mutational bias.

Page 13: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 13 of 21

Figure 7. Neutrality plot of genes M, N, S, E and RdRP. In these plots, each point corresponds toa single gene in a virus. The solid black lines in all panels are the bisectors corresponding to thoseCUBs occurring merely due to mutational bias (no selective pressure). The black-dotted lines are thelinear regressions. Red dots represent SARS-CoV-2 genes.

Specifically, in line with the ENC-plot analyses, the genes S and RdRP present the largest deviationsof their regression lines from the bisector lines, thereby indicating a stronger action of natural selection.Conversely, the regression line for the gene M is closer to the bisector than the other genes, meaningthat this gene is the least one subject to the action of natural selection. Finally, the genes E and N areintermediate between the previous cases.

Notably, almost all data points are clustered below the bisector lines, implying a selective tendencyfor a higher AU content in the first two codon positions than in the third one. Additionally, both GC3and GC12 are lower than 0.5, reflecting a general preference for A and U bases in all three codonpositions. Interestingly, data points associated to gene M and E are closer to the bisector lines comparedto genes N, S, and RdRP. Based on this observation, we could suggest that the GC content in the firsttwo codon positions tends to be in proportion to GC3 in genes M and E, and this partially explains thecloseness of these two genes to the Wright theoretical curve in Figure 5.

3.8. Forsdyke Plot of Individual Genes of SARS-CoV-2

We analyzed the DNA divergence and protein sequence divergence that characterize thesefive genes by comparing the nucleotide sequences of the newly emerging SARS-CoV-2 and theircorresponding protein sequences with those of other coronaviruses under study. Each SARS-CoV-2gene was compared to its orthologous gene in the 30 coronaviruses to estimate evolutionarydivergences. Each pair of orthologous genes is represented by a point in the Forsdyke plot [26],where protein divergences correlated with DNA divergence. Each point in the Forsdyke plots measuresthe divergence between pairs of orthologous genes in the two species, as projected along with thephenotypic (protein) and nucleotide (DNA) axis. Thus, the slope is an estimation of the fraction ofDNA mutations that result in amino acid substitutions [26]. In Figure 8, a separate Forsdyke plot isshown for each gene.

Overall, protein and DNA sequence divergences are linearly correlated, and these correlationscorrespond to slopes and intercepts of the regression lines.

Genes M and E display quite low slopes, indicating that these proteins tend to evolve slowly byaccumulating nucleotide mutations on their corresponding genes. Conversely, the steeper slopes forgenes N, RdRP and S suggest that these genes tend to evolve faster compared to other ones. A plausibleexplanation for this observation is that protein N, due to its immunogenicity, has been frequently usedto generate specific antibodies against various animal coronavirus, including SARS [42]. The viralreplicase polyprotein is essential for the replication of viral RNA, and finally, gene S encodes the

Page 14: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 14 of 21

protein that is responsible for the "spikes" present on the surface of coronaviruses. Our results suggestthat the higher divergence observed in these three proteins could represent a major obstacle to thedevelopment of an therapeutic treatment against SARS-CoV-2.

Figure 8. Forsdyke plots of genes M, N, S, E and RdRP. Phenotype (Protein div) vs. nucleotide (DNAdiv) sequence divergence between SARS-CoV-2 and orthologous genes in the other coronaviruses.Each point corresponds to an individual gene. In each panel, the best-fit line is shown in red,together with the associated values of the slope (m) and the intercept (q) in the legend.

4. Discussion

To investigate the factors determining the codon usage patterns of SARS-CoV-2 andother coronaviruses, several analytical methods were used in our study. First, the RSCU value of theSARS-CoV-2 was calculated. Despite the relatively high mutation rate that characterizes SARS-CoV-2,as other RNA viruses, we could not find any significant differences in codon usage between itsgenome and the ones of the other CoVs. Moreover, their associated vectors did not cluster based ongeographical position, further confirming the common origin of these genomes.

In line with the common nucleotide composition of other RNA viruses such as SARS, our resultsshow that SARS-CoV-2 has a high AU content and a low GC content. The results also indicate thatcodon usage bias exists and that SARS-CoV-2 prefers U-ending codons. The codon usage bias wasfurther confirmed by a mean ENC value of 51.9 (a value greater than 45 is considered a slight codonusage bias due to mutation pressure or nucleotide compositional constraints). These findings werealso corroborated by the CAI analysis, which measures the deviation of a given protein coding genesequence with respect to a reference set of the most highly expressed genes in the host. This suggeststhat those RNA viruses with high ENC values (and low CAI) adapt to the host with randomly chosencodons. Therefore, a slightly biased codon usage pattern might allow the virus to use several codonsfor a respective amino acid, and it might be beneficial for viral replication and translation in host cells.

We then analyzed in more detail the relationships between SARS-CoV-2 and various possiblehosts other than humans. For this purpose, we calculated the average CAI and SiD values ofindividual SARS-CoV-2 genes against different candidate hosts. Although previous studies do notsupport transmission of SARS-CoV-2 from snakes to humans [34,35], we showed that SARS-CoV-2has the highest CAI values by considering these two organisms as references, and therefore, it shoulduse codons that are better optimized to snakes and humans. Moreover, we demonstrated that theadaptiveness of SARS-CoV-2’s codon usage, as measured by SiD, is also fairly high for pangolins, rats,and bats, thereby confirming previous hypotheses regarding the possible origin of SARS-CoV-2 fromthese species [9].

The ENC-plot analysis indicated that natural selection plays an important role in the codonchoice of the five conserved viral genes under study; namely, RdRP, S, E, M and N. However, genes N,

Page 15: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 15 of 21

S and RdRP are more scattered below the theoretical curve compared to genes M and E, implying thatin the latter the codon usage is more a sign of mutational bias than of natural selection. Accordingto neutrality plot analysis, the genes S and RdRP are considered to be subject to more robust actionof natural selection; gene M is the least subject to natural selection; and the genes E and N are inan intermediate situation. Conversely, the regression line for the gene M is closer to the bisectorthan the other genes, meaning that this gene is the least subject to the action of natural selection.Finally, the genes E and N are intermediately affected regarding the previous cases.

Forsdyke plots were employed to analyze the mutation statuses of these five genes. Proteins M andE were found to have gentler slopes, thereby reflecting a tendency to evolve slowly by accumulatingnucleotide mutations on their respective genes. Conversely, the steeper slopes for the three genes N,RdRP and S (encoding a protein responsible for the "spikes" present on the surface of coronaviruses),indicate that these three genes, and therefore their corresponding protein products, evolve fastercompared to the other two genes.

Interestingly, all x-intercepts (see Table 4) are negative and the degree of negativity correlates withthe low slope values. Recalling that the x-axis (RNA change) can be viewed as a time axis, it appearsthat the RNA segments encoding M and E are as resistant to change during the early period of genomedivergence (negative x values) as they are during the later period of divergence when phenotypicchanges can be naturally selected (positive x values). M and E are less flexible at the protein level.On the other hand, the RNA segments encoding S, RdRP and N are flexible during the early genomedivergence period (high negative x values). As a result, these segments would have been more ableto contribute to the initial genotypic divergence that would have decreased recombination betweentwo genomes diverging in a common cell, thereby facilitating speciation. Under the protection of thisglobal “reproductive isolation”, the segments could then evolve during the period corresponding topositive x values. Without reproductive isolation, blending would have occurred and phenotypicdivergence would be less possible.

In future studies, it would be interesting to explore why M and E are less flexible and S, R and Nare more flexible towards preventing recombination. Viral RNA recombination requires recognitionbetween two comparable RNA regions and then extensive base pairing, mediated by the kissingstem-loop interaction, to thoroughly examine sequence complementarity. Perhaps the M and E geneslack the ability to form stem-loops, but this inflexibility during phenotypic divergence is suggestive ofhigh conservation.

The findings of the present study could be useful for developing diagnostic reagents and probesfor detecting a wide range of viruses and isolates in one test and for vaccine development, utilizing theinformation about codon usage patterns in these genes.

In addition, an interesting potential idea for the treatment of pneumonia-related to SARS-CoV-2and other similar viruses is a low dose of ionizing radiation (LDIR). SARS-COV-2 is an RNA viruswith an expected mutation rate similar to other RNA viruses, as discussed above. This mutation rateis usually much higher than the corresponding one of any human host. Therefore, as discussedin a recent paper [43], any antiviral drug against SARS-CoV-2 would exert an intense selectivepressure on the virus. This may result in highly adaptive and treatment-resistant virus types withenhanced pathogenicity. It should also be taken into consideration that the virus will create a systemicinflammatory response with detrimental effects in the host organism, i.e., acute respiratory distresssyndrome (ARDS), a form of severe hypoxemic respiratory failure associated with major inflammatoryinjury to the lung cells and extravasation of protein-rich edema fluid into the airspace [44,45].Low dose radiation (<0.5 Gy) has been shown to have indeed, in some cases, anti-inflammatory effectsand to modulate the immune response, and has even been suggested for treating pneumonia [46].This LDIR exposure is not expected to exert significant selective pressure on the new coronavirus.Therefore, and based also on recent suggestions, one can hypothesize that a low dose treatment of 30to 100 cGy to the lungs of a patient with COVID-19 pneumonia could ameliorate the inflammationsignificantly and relieve the life-threatening systemic symptoms of the infection [47].

Page 16: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 16 of 21

Table 4. Parameters of the linear regressions in Forsdyke plots. None of these plots intersect.The value of each parameter increases from M (lowest) to N (highest). Negative x values indicateflexibility to respond to global pressure to change DNA sequence (virtual time axis) in order to preventrecombination, and to thus allow species divergence (i.e., generate SARS-CoV-2). When recombinationis still possible, then two diverging genomes in the same cell will blend, so militating against proteindifferentiation would occur in time, corresponding to positive x values.

Gene Y Intercept Slope X Intercept

Matrix (M) 1.59 0.91 −1.74Envelope (E) 9.87 1.04 −9.46

Spike surface (S) 14.95 1.14 −13.15RNA replicase (RdRP) 17.82 1.18 −15.19

Nucleocapsid (N) 23 1.30 −17.93

Author Contributions: Conceptualization, all; methodology, M.D., S.F. and A.P.; software, M.D. and S.F.;validation, all; formal analysis, M.D., S.F. and A.P.; investigation, M.D., S.F. and A.P.; data curation, M.D.,S.F. and A.P.; writing—review and editing, all. All authors have read and agreed to the published version ofthe manuscript.

Funding: This research was partly funded by Sapienza University of Rome, grant RS_PICCOLI_2018_GIANSANTI.

Acknowledgments: The authors warmly thank Donald R. Forsdyke for critically reading the manuscript and forhis valuable comments and suggestions. We would also like to thank referees for their constructive comments.

Conflicts of Interest: The authors declare no conflict of interest.

Appendix A

Figure A1. Phylogenetic tree from GISAID.

Page 17: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 17 of 21

Figure A2. RSCU vectors of coronavirus. Patterns of RSCU vectors for 306 patients with SARS-CoV-2from different countries (data downloaded from GISAID).

Page 18: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 18 of 21

Figure A3. Unrooted ML-based tree of the 30 CoV genomic sequences. The four distinct color-codedclades correspond to the respective genera of CoVs. The SARS-CoV-2 sequence is indicated by a star.The branch lengths depict evolutionary distance. Bootstrap values higher than 50 are shown atthe nodes. The scale bar at the lower left denotes the length of nucleotide substitutions per position.

Figure A4. Similarity index (SiD) of SARS-CoV-2, using different host organisms as references.On the horizontal axis, the 13 eukaryotic species that were considered in the comparisons are shown.The host species are ranked in ascending order. CAI values for the bat, rat, hamster, snake, pangolin,and human are higher compared to the other species.

Page 19: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 19 of 21

Figure A5. Heatmaps of RSCU vectors for genes RdRP (upper panel) and S (bottom panel).

References

1. Lai, M.M.C. Coronavirus: Organization, replication and expression of genome. Annu. Vet. Microbiol. 1990,44, 303–333. [CrossRef] [PubMed]

2. Gorbalenya, A.E.; Enjuanes, L.; Ziebuhr, J.; Snijder, E.J. Nidovirales: Evolving the largest RNA virus genome.Virus Res. 2006, 117, 17–37. [CrossRef] [PubMed]

3. Siddell, S.G.; Ziebuhr, J.; Snijder, E.J. Coronaviruses, Toroviruses, and Arteriviruses. Topley Wilson’s Microbiol.Microb. Infect. 2005. [CrossRef]

4. Woo, P.C.; Huang, Y.; Lau, S.K.; Yuen, K.Y. Coronavirus genomics and bioinformatics analysis. Viruses 2010,2, 1804–1820. [CrossRef] [PubMed]

5. Fouchier, R.A.; Kuiken, T.; Schutten, M.; Van Amerongen, G.; Van Doornum, G.J.; Van den Hoogen, B.G.;Peiris, M. Aetiology: Koch’s postulates fulfilled for SARS virus. Nature 2003, 423, 240. [CrossRef]

6. Van der Hoek, L.; Pyrc, K.; Jebbink, M.F.; Vermeulen-Oost, W.; Berkhout, R.J.; Wolthers, K.C.; Wertheim-vanDillen, P.M.; Kaandorp, J.; Spaargaren, J.; Berkhout, B. Identification of a new human coronavirus. Nat. Med.2004, 10, 368–373. [CrossRef]

7. Woo, P.C.; Lau, S.K.; Chu, C.M.; Chan, K.H.; Tsoi, H.W.; Huang, Y.; Wong, B.H.; Poon, R.W.; Cai, J.J.; Luk,W.K.; et al. Characterization and complete genome sequence of a novel coronavirus, coronavirus HKU1,from patients with pneumonia. J. Virol. 2005, 79, 884–895. [CrossRef]

8. Zaki, A.M.; Van, B.S.; Bestebroer, T.M.; Osterhaus, A.D.; Fouchier, R.A. Isolation of a novel coronavirus froma man with pneumonia in Saudi Arabia. N. Engl. J. Med. 2012, 367, 1814–1820. [CrossRef]

9. Andersen, K.G.; Rambaut, A.; Lipkin, W.I.; Holmes, E.; Garry, R.F. The proximal origin of SARS-CoV-2.Nat. Med. 2020. [CrossRef]

10. Gorbalenya, A.E.; Baker, S.C.; Baric, R.S.; de Groot, R.J.; Drosten, C.; Gulyaev, A.A. The species Severe acuterespiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2. Nat. Microbiol.2020, 5, 536–544. [CrossRef]

11. Forni, D.; Cagliani, R.; Clerici, M.; Sironi, M. Molecular Evolution of Human Coronavirus Genomes.Trends Microbiol. 2017, 25, 35–48. [CrossRef] [PubMed]

12. Sheikh, A.; Al-Taher, A.; Al-Nazawi, M.; Al-Mubarak, A.I.; Kandeel, M. Analysis of preferred codon usage inthe coronavirus N genes and their implications for genome evolution and vaccine design. J. Virol. Methods2020, 277, 113806. [CrossRef] [PubMed]

13. Belalov, I.S.; Lukashev, A.N. Causes and Implications of Codon Usage Bias in RNA Viruses PLoS ONE 2013,8, e56642. [CrossRef] [PubMed]

Page 20: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 20 of 21

14. Dilucca, M.; Cimini, G.; Giansanti, A. Essentiality, conservation, evolutionary pressure and codon bias inbacterial genomes. Gene 2018, 663, 178–188. [CrossRef] [PubMed]

15. Forcelloni, S.; Giansanti, A. Evolutionary Forces and Codon Bias in Different Flavors of Intrinsic Disorder inthe Human Proteome. J. Mol. Evol. 2020, 88, 164–178. [CrossRef]

16. Grantham, R.; Gautier, C.; Gouy, M.; Mercier, R.; Pavé, A. Codon catalog usage and the genome hypothesis.Nucleic Acids Res. 1980, 8, r49–r62. [CrossRef]

17. Jenkins, G.; Holmes, E.C. The extent of codon usage bias in human RNA viruses and its evolutionary origin.Virus Res. 2003, 92, 1–7. [CrossRef]

18. Chen, Y.; Xu, Q.; Yuan, X.; Li, X. Analysis of the codon usage pattern in Middle East Respiratory SyndromeCoronavirus. Oncotarget 2017, 8, 110337–110349. [CrossRef]

19. Sharp, P.M.; Wen-Hsiung, L. An evolutionary perspective on synonymous codon usage in unicellularorganisms. J. Mol. Evol. 1986, 24, 28–38. [CrossRef]

20. Weinstein, J.N.; Myers, T.G.; O’Connor, P.M.; Friend, S.H.; Fornace, A.J., Jr.; Kohn, K.W.; Fojo, T.; Bates, S.E.;Rubinstein, L.V.; Anderson, N.L.; et al. An information-intensive approach to the molecular pharmacologyof cancer. Science 1997, 275, 343–349. [CrossRef]

21. Xia, X. DAMBE5: A comprehensive software package for data analysis in molecular biology and evolution.Mol. Biol. Evol. 2013, 30, 1720–1728. [CrossRef] [PubMed]

22. Sharp, P.M.; Wen-Hsiung, L. The codon adaptation index—A measure of directional synonymous codonusage bias, and its potential applications. Nucleic Acids Res. 1987, 15, 3. [CrossRef] [PubMed]

23. Lia, G.; Wang, H.; Wanga, S.; Xinga, G.; Zhanga, C.; Zhanga, W. Insights into the genetic and host adaptabilityof emerging porcine circovirus. Virulence 2018, 9, 1301–1313. [CrossRef] [PubMed]

24. Wright, F. The ’effective number of codons’ used in a gene. Gene 1990, 87, 23–29. [CrossRef]25. Sueoka, N. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 1988,

85, 2653–2657. [CrossRef]26. Forcelloni, S.; Giansanti, A. Mutations in disordered proteins as early indicators of nucleic acid changes

triggering speciation. Sci. Rep. 2020, 10, 4467. [CrossRef]27. Sievers, F.; Higgins, D.G. Clustal omega. Curr. Protoc. Bioinform. 2014, 48, 3–13. [CrossRef]28. Sievers, F.; Higgins, D.G. Clustal Omega, accurate alignment of very large numbers of sequences.

Methods Mol. Biol. 2014, 1079, 105–116._6. [CrossRef]29. Kumar, S.; Stecher, G.; Li, M.; Knyaz, C.; Tamura, K. MEGA X: Molecular Evolutionary Genetics Analysis

across Computing Platforms. Mol. Biol. Evol. 2018, 35, 1547–1549. [CrossRef]30. Darriba, D.; Posada, D.; Kozlov, A.M.; Stamatakis, A.; Morel, B.; Flouri, T. ModelTest-NG: A New and

Scalable Tool for the Selection of DNA and Protein Evolutionary Models. Mol. Biol. Evol. 2020, 37, 291–294.[CrossRef]

31. Berkhout, B.; Van Hemert, F. On the biased nucleotide composition of the human coronavirus RNA genome.Virus Res. 2015, 202, 41–47. [CrossRef]

32. Woo, P.C.; Wong, B.H.; Huang, Y.; Lau, S.K.; Yuen, K.Y. Cytosine deamination and selection of CpGsuppressed clones are the two major independent biological forces that shape codon usage bias incoronaviruses. Virology 2007, 369, 431–442. [CrossRef]

33. Ji, W.; Wang, W.; Zhao, X.; Zai, J.; Li, X. Cross-species transmission of the newly identified coronavirus2019-nCoV. J. Med. Virol. 2020. [CrossRef] [PubMed]

34. Callaway, E.; Cyranoski, D. Why snakes probably aren’t spreading the new China virus. Nature 2020, 577, 1.[CrossRef]

35. Zhang, C.; Zheng, W.; Bell, E.W.; Zhou, X.; Zhang, Y. Protein Structure and Sequence Reanalysis of 2019-nCoVGenome Refutes Snakes as Its Intermediate Host and the Unique Similarity between Its Spike ProteinInsertions and HIV-1. J. Proteome Res. 2020, 19, 1351–1360. [CrossRef]

36. Lam, T.T.Y.; Shum, M.H.H.; Zhu, H.C.; Tong, Y.G.; Ni, X.B.; Liao, Y.S.; Wei, W.; Cheung, W.Y.M.; Li, W.J.; Li,L.F.; et al. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins. Nature 2020. [CrossRef][PubMed]

37. Ceraolo, C.; Giorgi, F.M. Genomic variance of the 2019-nCoV coronavirus. J. Med. Virol. 2020. [CrossRef]38. Woo, P.C.; Lau, Y.; Huang, Y.; Yuen, K.Y. Coronavirus Diversity, Phylogeny and Interspecies Jumping.

Exp. Biol. Med. 2009, 234, 1117–1127. [CrossRef]

Page 21: SARS-CoV-2 Genes - MDPI

Viruses 2020, 12, 498 21 of 21

39. Cavanagh, D. The Coronavirus Surface Glycoprotein. In The Coronaviridae; Springer: Boston, MA, USA, 1995;pp. 73–113. [CrossRef]

40. Ruch, T.; Machamer, C. The coronavirus E protein: assembly and beyond. Viruses 2012, 4, 363–382.41. Neuman, B.W.; Kiss, G.; Kunding, A.H.; Bhella, D.; Baksh, M.F.; Connelly, S. A structural analysis of M

protein in coronavirus assembly and morphology. J. Struct. Biol. 2011, 174, 11–22. [CrossRef]42. Timani, K.A.; Ye, L.; Ye, L.; Zhu, Y.; Wu, Z.; Gong, Z. Cloning, sequencing, expression, and purification of

SARS-associated coronavirus nucleocapsid protein for serodiagnosis of SARS. J. Clin. Virol. 2004, 30, 309–312.[CrossRef]

43. Ghadimi-Moghadam, A.; Haghani, M.; Bevelacqua, J.J.; Jafarzadeh, A.; Kaveh-Ahangar, A.; Mortazavi, S.M.J.;Ghadimi-Moghadam, A.; Mortazavi, S.A.R. COVID-19 Tragic Pandemic: Concerns over Unintentional“Directed Accelerated Evolution” of Novel Coronavirus (SARS-CoV-2) and Introducing a Modified TreatmentMethod for ARDS. J. Biomed. Phys. Eng. 2020, 10, 241–246. [CrossRef] [PubMed]

44. SeungHye, H.; Rama, K.M. The acute respiratory distress syndrome: from mechanism to translation.J. Immunol. 2015, 194, 855–860, doi:10.4049/jimmunol.1402513.

45. Zhang, W.; Zhao, Y.; Zhang, F.; Wang, Q.; Li, T.; Liu, Z.; Wang, J.; Qin, Y.; Zhang, X.; Yan, X.; et al. The useof anti-inflammatory drugs in the treatment of people with severe coronavirus disease 2019 (COVID-19):The experience of clinical immunologists from China. Clin Immunol. 2020, 108393. [CrossRef]

46. Calabrese, E.J.; Dhawan, G. How radiotherapy was historically used to treat pneumonia: Could it be usefultoday? Yale J. Biol. Med. 2013, 86, 555–570. [CrossRef]

47. Kirby, C.; Mackenzie, M. Is low dose radiation herapy a potential treatment for COVID-19 pneumonia?Radiother. Oncol. 2020. [CrossRef]

c© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open accessarticle distributed under the terms and conditions of the Creative Commons Attribution(CC BY) license (http://creativecommons.org/licenses/by/4.0/).


Recommended