1,475
Views
0
CrossRef citations to date
0
Altmetric
ORIGINAL ARTICLE

Mutation prediction and phylogenetic analysis of SARS-CoV2 protein sequences using LSTM based encoder-decoder model

, , & ORCID Icon
Pages 103-121 | Received 06 Oct 2022, Accepted 03 Mar 2023, Published online: 23 Mar 2023

Abstract

The ongoing evolution and mutation of SARS-CoV2 pose a significant challenge to the development of effective medication, as genetic changes can render previously developed drugs ineffective. To address this issue, researchers are exploring various strategies to predict and assess the emergence of novel SARS-CoV2 strains through phylogenetic analysis and mutation prediction. In recent years, deep learning approaches have been applied to studying viruses, including SARS-CoV2, to improve our understanding of virus evolution, structure, categorization, and prediction. In this study, a novel deep learning approach is proposed to predict and assess SARS-CoV2 protein sequences. Specifically, Long Short-Term Memory (LSTM) is utilized to predict protein sequences from aligned input sequences, with a bioinformatics tool used to detect mutations. The deep learning model proposed in this study exhibits high accuracy in predicting several key SARS-CoV2 protein sequences, including spike, replicase, putative, ORF1a, and nucleocapsid. The study uses genome sequencing data from the National Center for Biotechnology Information (NCBI) and demonstrates a 98% accuracy in predicting genomic sequences, with minimal changes observed in protein sequences. This study represents a significant improvement over previous research, which has focused only on predicting mutations in viral RNA sequences using datasets from other viruses.

1. Introduction

Adenine (A), thymine (T), cytosine (C), and guanine (G) are the four nitrogen-containing nucleobases that make up all nucleotides (G). The RNA sequence differs from the DNA sequence because it has a more significant mutation and is more stable (Mohamed, Sayed, Salah, & Houssein, Citation2021). SARS-CoV-2 is scattering rapidly due to the inaccuracy of current recognition technologies (Lopez-Rincon et al., Citation2021). SARS-CoV-2, on the other hand, is a typical RNA virus that generates new mutations in a Coronavirus replication cycle, including 10-4 nucleotide substitutions per year is the usual evolutionary rate each year per site (Lu et al., Citation2020). SARS-CoV2 belongs to the Coronaviridae family (Whata & Chimedza, Citation2021), and its identification can be challenging due to mutations. So, this paper has explored the concepts of detecting the mutation and prediction of sequences using the deep learning method. Having access to current virus mutations and prior evolution could help researchers better understand virus evolution dynamics and predict future viruses and diseases (Shendure & Ji, Citation2008).

In human disease genetics, the prediction of genetic mutations is a hot topic (Stranger & Dermitzakis, Citation2006). Knowing about current virus generations and their prior evolution could serve to understand the dynamics of virus evolution and forecast future viruses and diseases (Shendure & Ji, Citation2008). The ancestral sequence of these species is inferred via phylogenetic analysis, which determines the evolutionary relationship between them. These evolutionary connections between RNA sequences can help anticipate which lines may have the same function (Xu et al., Citation2015).

This paper presents several significant contributions in the field of bioinformatics. First, a novel method is proposed for the alignment of protein sequences to identify mutations and assess the similarity between genomic sequences. This technique employs advanced algorithms for sequence alignment and statistical analysis, enabling accurate and reliable comparisons of protein sequences. Second, an evolution tree is generated for the protein sequences of SARS-CoV2, providing insight into the relationships and origins of different strains of the virus. Third, a Long Short-Term Memory (LSTM) based Encoder-Decoder deep learning model is developed to predict mutations in protein sequences of SARS-CoV2. This model utilizes machine learning algorithms to analyze large datasets of protein sequences and associated mutation data, enabling accurate predictions of specific mutations in the viral genome. Finally, it also presents a method for predicting nucleotide changes and identifying new strains of the virus in the new generation. Overall, these contributions represent significant advances in the study of SARS-CoV2 and provide valuable tools and techniques for understanding the virus’s evolution, pathogenicity, and potential for developing new treatments and vaccines. Hence, the approach taken in this study provides a more comprehensive analysis of the mutations present in SARS-CoV2 protein sequences and has the potential to improve our ability to predict and respond to emerging strains of the virus.

There are several tools available for aligning protein sequences, including:

Clustal Omega: This is a popular online tool for multiple sequence alignment of proteins. You can input up to 500 sequences in FASTA or Clustal format and choose different options for alignment parameters. The output can be visualized as an alignment or a tree (Sievers & Higgins, Citation2018).

MUSCLE: This is another online tool for protein sequence alignment. It allows you to input up to 500 sequences and provides options for alignment parameters, such as the gap opening penalty and the gap extension penalty (Edgar, Citation2004).

T-Coffee: This tool provides a variety of alignment methods and allows you to input multiple sequence formats, including FASTA, EMBL, and UniProt. T-Coffee also allows you to visualize the alignment output in a variety of ways (Taly et al., Citation2011).

BioEdit: It is a popular desktop software tool for sequence alignment, visualization, and analysis. It is widely used by researchers and has many useful features for working with DNA, RNA, and protein sequences. One of the key features of BioEdit is its ability to align multiple sequences using a variety of algorithms, including ClustalW, T-Coffee, and MUSCLE. The software also allows for manual editing of alignments, which can be useful for fine-tuning the alignment or correcting errors. In addition to alignment, BioEdit can be used for a variety of other tasks, such as sequence annotation, primer design, and restriction enzyme analysis. The software also includes visualization tools, such as the ability to generate graphical representations of alignments or sequence features (Hall, Citation1999; Tomita, Mori, & Mochizuki, Citation2015; Carvalho, Fischer, & Chen, Citation2009).

Alternatively, we can use bioinformatics software packages, such as MEGA, which allow you to align multiple sequences, generate phylogenetic trees, and perform other analyses. These software packages are typically more powerful and flexible than online tools but require more expertise to use effectively.

There have been various traditional machine learning approaches like Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), and various deep learning approaches for predicting the sequences like Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), Long Short Term Memory (LSTM). The advantage of using the deep learning approach is that it permits variable length sequences as input and output. Long Short Term Memory has been extensively used in the literature for predicting the genomic sequences of other viruses. Because LSTM is capable of capturing the longer sequences with several gating mechanisms (Zhou et al., Citation2023).

Predicting mutations in protein sequences is an important task in the field of bioinformatics, and there are several tools and techniques available to do so. One approach is to use in silico methods to predict the impact of mutations on protein structure and function. This can be done using software programs that analyze the effects of amino acid changes on the physical and chemical properties of the protein, such as its stability, solubility, and interactions with other molecules. Another approach is to use machine learning algorithms to predict the likelihood of specific mutations occurring in a given protein sequence. Machine learning models can be trained on large datasets of protein sequences and associated mutation data to learn patterns and make predictions about the likelihood of specific mutations occurring. Overall, predicting mutations in protein sequences is an important area of research for understanding the evolution and pathogenicity of SARS-CoV-2, as well as for developing new treatments and vaccines (Kumar, Stecher, & Tamura, Citation2016).

The following is the structure of the entire paper: The introduction to the research work is included in Section 1. Section 2 contains a review of previous research. Section 3 details the data sources and methods used in the prediction. The intended work’s outcome is shown in Section 4. Finally, section 5 summarizes the work that can be done and its future scope.

2. Literature review

SARS-CoV-2 is an RNA virus, and like all RNA viruses, it has a high mutation rate. Mutations are changes in the genetic material (in this case, RNA) of the virus. Mutations can be beneficial, harmful, or neutral to the virus, depending on their effects on the virus’s survival and ability to replicate. There have been many mutations identified in SARS-CoV-2 since the start of the pandemic. Some of these mutations are more significant than others and can affect the behavior of the virus. One particular mutation, known as the D614G mutation, has been associated with increased transmissibility of the virus. Other mutations have been identified in the spike protein of the virus, which is the protein that allows the virus to enter and infect human cells. Some of these mutations may make the virus more infectious or more resistant to antibodies generated by vaccination or previous infection. It’s important to note that not all mutations are necessarily a cause for concern. Many mutations may not affect the behavior of the virus or may even weaken the virus. However, monitoring mutations is an important part of understanding how the virus is evolving and how it may respond to vaccines and treatments. That’s why ongoing genomic surveillance is crucial in tracking the spread and evolution of SARS-CoV-2 (Rambaut et al., Citation2020).

The mutation rate of the complete genome sequence of SARS-CoV-2 has been investigated using patient datasets from various countries. Based on the collected data, specific nucleotide and codon mutations have been identified. The mutation rate has been divided into four groups according to the dataset size: China, Australia, the United States, and the rest of the world. Although codons have a lower mutation rate than nucleotides, a substantial number of thymine (T) and adenine (A) nucleotides have been found to change to other nucleotides in all locations. The Long Short-Term Memory (LSTM) model has been used to predict the nucleotide mutation rate of the 400th patient. The mutation rate increases by 0.1 percent when nucleotides change from T to C and G, C to G, and G to T, whereas changing T to A and A to C lowers the score by 0.1 percent. The study explores how COVID-19 genomic sequences can be utilized to extract meaningful information using artificial intelligence methods. Sequential Pattern Mining (SPM) is first applied to a corpus of machine-readable COVID-19 genome sequences to determine whether any significant hidden patterns, such as recurrent patterns of nucleotide bases and their interactions, may be discovered. Sequence predictions are then applied to the corpus to determine whether nucleotide bases can be anticipated from earlier ones. Finally, an algorithm is developed for genome sequence mutation analysis to identify regions in genome sequences where nucleotide bases change and to determine the mutation rate. The results demonstrate that by utilizing SPM and mutation analysis techniques, it is possible to detect intriguing trends in the COVID-19 genomic sequences, allowing for the evaluation of the evolution and variability of COVID-19 strains (Pathan, Biswas, & Khandaker, Citation2020; Nawaz, Fournier-Viger, Shojaee, & Fujita, Citation2021).

Few Researchers used the seq2seq LSTM neural network to predict next-generation sequences by using the method while treating the sequences as textual data (Mohamed et al., Citation2021). As a result of using single hot vectors as input, the model retains the important information position of each nucleotide in the sequences. Two RNA virus sequencing datasets were used to test the proposed model, and the findings were promising. The results show how the LSTM neural network for DNA and RNA sequences can be used to handle a variety of bioinformatics sequencing difficulties (Chen, Gao, Wang, & Wei, Citation2021). examines the mechanism, frequency, and ratio of mutations in the S protein, which is a frequent target of the majority of COVID-19 vaccines and antibody treatments. 56 antibody constructions were also generated, and their 2D and 3D properties were studied. Additionally, it is anticipated that mutations will change the binding free energies (BFE) of S protein and antibody or ACE2 complexes. The majority of the 462 mutations on the receptor-binding domain (RBD) degrade the binding of S protein and antibodies, jeopardizing the effectiveness and dependability of antibody treatments and vaccinations, according to research that combines genetics, biophysics, deep learning, and algebraic topology (Nguyen et al., Citation2021). Utilizing deep learning approaches, this study describes and analyses genetic changes in SARS-coding CoV-2 areas, as well as their predicted effects on protein secondary structure and solvent accessibility. The predictions indicate that the highly publicized mutation D614G in the viral spike protein is unlikely to affect the protein’s secondary structure or relative solvent availability. Based on 6324 viral genome sequences, the author created a mutational spreadsheet dataset to support research into SARS-CoV-2 from a variety of angles, particularly in tracing the virus’s evolution and global distribution. The results also demonstrate that E, M, ORF6, ORF7a, ORF7b, and ORF10 are the most stable coding genes, suggesting that these genes may be used to create vaccines and treatments.

The most recent COVID-19 pandemic is currently raging, with new strains including surprising changes. Understanding how to predict virus alterations has important implications for developing vaccines and medications, and prevention strategies. Because the number of reported changes in SARS-CoV-2 is currently restricted, creating a prediction model employing virus data with many mutations, such as the influenza A virus, would be advantageous and straightforward. The likelihood mutation sites and changed amino acids in hemagglutinins from the Eurasia H1 influenza A virus were predicted using a neural network with a feedforward backpropagation algorithm in this study (Yan & Wu, Citation2021). The purpose of the study is to use one of the most comprehensive data sets available, which includes 506,768 SARS-CoV-2 genome sequences, to follow fast-spreading RBD mutations in pandemic-affected countries and investigate their evolutionary tendencies around the world. There were 6945 unique single mutations found on the S protein, with 1024 of them occurring on the RBD. 100 of the 651 non-degenerate variants on the RBD were detected more than 28 times in the database and deemed significant protein sequence alterations. Also, it showed that in addition to the N501Y, E484K, and K417N modifications in the UK, South Africa, and Brazil variations, L452R, E484Q mutations in India, S477N, N439K, S477R, and N501T variations in 31 disease outbreak countries in the last few months, N439K, S477R, S477N, and N501T mutations (Wang, Chen, Gao, & Wei, Citation2021).

There are many deep learning models that can be used for predicting amino acid sequences. Some of the latest models are, Alphafold which is developed by DeepMind, Alphafold uses deep learning to predict the 3D structure of proteins. It won the 2020 CASP14 competition by accurately predicting the structures of 25 out of 43 proteins. RoseTTAFold, which is developed by the University of Washington, RoseTTAFold uses a combination of deep learning and template-based modeling to predict the 3D structure of proteins. It outperformed other methods in the CASP14 competition. TAPE which is developed by the University of California, Berkeley, TAPE (The TAProot Ensemble) is a deep learning model that can predict various protein properties, including secondary structure, solvent accessibility, contact prediction, and remote homology detection. ProGenc, which is developed by Stanford University, ProGen is a deep learning model that can predict the amino acid sequence of a protein from its 3D structure. UniRep which is developed by Harvard University, UniRep is a deep learning model that can encode protein sequences into fixed-length vectors that can be used for various downstream tasks, such as protein function prediction and protein-protein interaction prediction. These models are constantly being improved upon and new models are also being developed, so it’s always worth keeping up to date with the latest research in the field.

3. Data source and methods

3.1. Data source

Predicting SARS-CoV-2 mutations is a complex task that requires expertise in both bioinformatics and deep learning. LSTM encoder-decoder models have been used in many natural language processing tasks, but they can also be applied to sequence prediction tasks, such as SARS-CoV-2 mutation prediction. Our evaluation collected the dataset from the National Center for Biotechnology Information (NCBI) (National Center for Biotechnology Information \(NCBI\) Bethesda \(MD\), 1988). Total 250 SARS-CoV2 variants were considered (Sah, Surendiran, & Dhanalakshmi, Citation2023). This is the world’s largest dataset repository for genomic sequences. The information gathered pertains to all protein sequences. Our dataset is in FASTA format. The experimental setup for predicting the mutation rate of sequences is shown in .

Table 1. Experimental setup for the proposed model.

contains information about the experimental requirements for the proposed work.

An LSTM-based model can be trained on a large dataset of protein sequences and their corresponding mutation information to learn patterns and relationships between the sequences and mutations. The model can then be used to predict mutations in new protein sequences based on those patterns and relationships. In the case of SARS-CoV-2, an LSTM-based model can be trained on a dataset of protein sequences from different strains of the virus and their associated mutation information. The model can learn how different mutations affect the structure and function of the viral proteins and use that knowledge to predict the effects of new mutations. The input to the model is a sequence of amino acids that make up the protein, and the output is the predicted mutation(s) and their effects on the protein. The model uses the previous state of the LSTM to encode the sequence and then decodes it to generate the prediction. The LSTM model can be a powerful tool for predicting mutations in protein sequences, but it is important to note that the accuracy of the predictions depends on the quality and size of the training dataset, as well as the features and parameters of the model itself.

Amino acids are a group of 20 chemicals that make up proteins. Proteins are made up of polypeptides, long chains of amino acids. The amino acid chain sequence causes the polypeptide to fold into a physiologically active form. Protein amino acid sequences are stored in the genes (Smith, Citation2019; Sah, Surendiran, Dhanalakshmi, & Kamerkar, Citation2021). Annexure 1 shows the protein names considered for the experimentation, showing the entire common amino acid sequences and sequence length for each protein after alignment. The amino acid sequences are collected from NCBI (National Center for Biotechnology Information \(NCBI\) Bethesda \(MD\), 1988).

3.2. Proposed LSTM model

Long short-term memory (LSTM) is a variant of RNNs (Hochreiter & Schmidhuber, Citation1997) that can learn long-term dependencies and is specifically designed to avoid the problem of long-term dependencies. In the context of protein sequence prediction, LSTMs have been used to predict the secondary structure of proteins, as well as the binding affinity between proteins and ligands. LSTMs can also be used to predict the sequence of a protein from its genetic sequence, which is an important step in drug design and other bioinformatics applications. LSTMs work by passing information through a series of "gates" that control the flow of information through the network. These gates allow the LSTM to selectively remember or forget previous inputs, which enables it to maintain a long-term memory of the input sequence. The output of the LSTM is then used to make a prediction about the next item in the sequence. To train the LSTM model, the protein sequence data is converted to one-hot encoded format using the np.eye(n_classes) function, which creates an identity matrix with n_classes rows and columns. Each row of the identity matrix corresponds to an amino acid, and each column corresponds to a position in the protein sequence. The seq_data_one_hot variable is a 3D numpy array with shape (n_samples, max_seq_len, n_classes). The LSTM model is defined using the Keras Sequential model API, with one LSTM layer and one dense output layer. The model is compiled with the appropriate loss function and optimizer.

The protein sequence data is split into training and validation sets, and the model is trained using the fit method of the Keras Sequential model API. , shows the flow and contribution for the proposed work.

Table 2. Contribution summary.

Protein sequences can be fed into an LSTM model for training using Python and the Keras deep learning library. A LSTM unit has a cell/node, an input gate, an output gate, and a forget gate at its base. The node considers values during particular time intervals, while the input/output gates control the information flow (Koumakis, Citation2020). Long Short-Term Memory (LSTM) model is proposed to predict the amino acid sequences or protein sequences of the virus. The proposed work consists of a few steps. In the initial step, the protein sequence is preprocessed. Before alignment total sequences considered were 149, each length varying from 5k to 6k. After alignment, the sequence length considered as shown in .

Table 3. Total length of protein sequences before and after alignment.

Now, the LSTM model applies one hot encoding representation of the sequences to the same-length input. The one-hot encoded vector is added by each LSTM cell to the hidden state and cell state vectors. The third phase of the encoder output is the cell state and hidden state concealed values vectors. With the exception of the first cell, which receives its cell and hidden states directly from the encoder, each subsequent cell now derives its cell and hidden states from the one before it. The probability distribution of the word at position t in the succeeding generation sequence is predicted using the dense layer. A phylogenetic tree is a branching diagram that represents the evolutionary relationships among a set of organisms or sequences, based on the similarities and differences in their genetic or protein sequences. In the case of SARS-CoV-2, a phylogenetic tree can be constructed using the amino acid sequences of the virus. The tree can help to visualize the evolutionary history of SARS-CoV-2, and can be used to identify the origin of the virus, its transmission patterns, and the emergence of new variants. The tree is typically constructed using bioinformatics software that can align the amino acid sequences, calculate the genetic distances between them, and infer the branching patterns. One common software used to construct phylogenetic trees is MEGA (Molecular Evolutionary Genetics Analysis), which can handle large datasets and provide various options for phylogenetic analysis, including maximum likelihood, neighbor-joining, and Bayesian inference. Other software tools used for phylogenetic analysis include RAxML, PhyML, and BEAST. The resulting tree can be visualized using software such as FigTree or iTOL (Interactive Tree of Life), which allows for further customization and annotation of the tree. The phylogenetic tree can provide valuable insights into the evolution and diversity of SARS-CoV-2, and can inform public health measures and vaccine development strategies.

Workflow:

  • Step 1: Import necessary Libraries

  • Step 2: Load Amino Acid Sequences

  • Step 3: For each sequence, map unique chars to integer by creating dictionary

  • Step 4: Prepare the genomic dataset of input to output pairs encoded as integer

  • Step 5: Reshape the data

  • Step 6: Apply one hot encoding

  • Step 7: Define LSTM model

  • Step 8: Define the checkpoint

  • Step 9: Fit the model

  • Step 9: Load the network weights

  • Step 10: Compute Accuracy

, shows the workflow for the proposed work for predicting the mutations in the protein sequence.

Figure 1. Proposed model for predicting the sequences.

Figure 1. Proposed model for predicting the sequences.

The mutations in SARS-CoV-2 can have a significant impact on pathogenicity, diagnostics, therapeutics, and vaccines. Here are some of the ways in which mutations can impact each of these areas (Centers for Disease Control & Prevention, 2021; World Health Organization, Citation2021; Korber et al., Citation2020; Lauring & Hodcroft, Citation2021):

  1. Pathogenicity: Mutations in SARS-CoV-2 can affect how the virus interacts with the host cells, leading to changes in the severity of the disease. For example, some mutations have been associated with increased transmission and more severe disease, while others have been associated with decreased virulence. Mutations in the spike protein can affect the virus’s ability to bind to the ACE2 receptor on host cells, which is a key step in the viral infection process.

  2. Diagnostics: Mutations in SARS-CoV-2 can impact diagnostic tests, particularly those that rely on detecting viral RNA. For example, some mutations can cause false negative results in PCR tests, which can lead to incorrect diagnoses and potentially contribute to the spread of the virus. New strains of the virus that carry multiple mutations may require updates to existing diagnostic tests to ensure their accuracy.

  3. Therapeutics: Mutations in SARS-CoV-2 can impact the effectiveness of therapeutic treatments. For example, some mutations in the spike protein can affect the binding of neutralizing antibodies, making certain treatments less effective. The emergence of new strains of the virus can also impact the effectiveness of existing treatments and require the development of new therapies.

  4. Vaccines: Mutations in SARS-CoV-2 can impact the effectiveness of vaccines. For example, mutations in the spike protein can affect the ability of the immune system to recognize and neutralize the virus. If a mutation occurs in a region of the virus that is targeted by a vaccine, it can reduce the vaccine’s effectiveness. The emergence of new strains of the virus may require the development of updated or new vaccines to ensure their effectiveness.

While complementary and alternative medicinal plants, such as 6-shogaol, have been shown to have potential therapeutic properties, their effectiveness as a treatment for an evolving virus like SARS-CoV-2 is uncertain, and they should not be used as a replacement for conventional treatments. 6-shogaol is a natural compound found in ginger and has been studied for its potential medicinal properties, including antiviral activity. Some studies have suggested that 6-shogaol may have activity against various viruses, including influenza, herpes simplex virus, and respiratory syncytial virus. However, there is currently no clinical evidence to support the use of 6-shogaol as a treatment for COVID-19, and more research is needed to determine its safety and effectiveness.

It is important to note that conventional treatments, such as vaccines, antiviral drugs, and supportive care, have undergone rigorous testing and have been shown to be effective in treating COVID-19. While complementary and alternative medicinal plants may have potential benefits, they should be used in combination with, and not as a replacement for, conventional treatments.

4. Results & discussion

To find the mutations, our work consists of an alignment of protein sequences using bioinformatics tools like bioedit. In order to provide various fundamental functions like editing, aligning, manipulating, and analysing protein and nucleic sequences, BioEdit is a biological sequence editor that works on Windows. Although it lacks the capacity of more robust sequence analysis applications. BioEdit tool provides a number of quick and simple functions for annotating, editing, and manipulating sequences. Genomic sequence alignment is a way of arranging the protein sequences or DNA sequences to figure out similar regions which may be a reason of evolutionary relationships between the genomic sequences (Hall, Citation2004). Alignment of sequences has been performed to find the point mutations. shows the alignment of replicase protein sequences, here multiple replicase proteins have been considered which is found common for SARS-CoV2 different variants. The tool will align the sequences and save the aligned sequences in FASTA format. A phylogenetic tree is a branching diagram that represents the evolutionary relationships among a set of organisms or sequences, based on the similarities and differences in their genetic or protein sequences. In the case of SARS-CoV-2, a phylogenetic tree can be constructed using the amino acid sequences of the virus. The tree can help to visualize the evolutionary history of SARS-CoV-2, and can be used to identify the origin of the virus, its transmission patterns, and the emergence of new variants. The tree is typically constructed using bioinformatics software that can align the amino acid sequences, calculate the genetic distances between them, and infer the branching patterns. One common software used to construct phylogenetic trees is MEGA (Molecular Evolutionary Genetics Analysis), which can handle large datasets and provide various options for phylogenetic analysis, including maximum likelihood, neighbor-joining, and Bayesian inference. Other software tools used for phylogenetic analysis include RAxML, PhyML, and BEAST. The resulting tree can be visualized using software such as FigTree or iTOL (Interactive Tree of Life), which allows for further customization and annotation of the tree. The phylogenetic tree can provide valuable insights into the evolution and diversity of SARS-CoV-2, and can inform public health measures and vaccine development strategies. shows the phylogenetic or evolution tree of replicase proteins which estimates the relationships among the sequences. This estimation can help in prescribing vaccines against them. This might give birth to new treatment options and also standing the progression of the virus.

Figure 2. Alignment of replicase proteins.

Figure 2. Alignment of replicase proteins.

Figure 3. Phylogenetic tree of replicase proteins.

Figure 3. Phylogenetic tree of replicase proteins.

The study of the link between biological lineages that have a common ancestor is known as phylogeny. To infer phylogeny, the differences between aligned sequences of genomes and proteins are measured and presented in the form of a tree, with modern species, intermediates, and common ancestors occupying the terminal nodes, internal nodes, and root, respectively. The tree’s topology, branch length, shape, and root position are distinct features (Gorbalenya & Lauber, Citation2017).

shows the alignment of Spike GlycoProtein sequences, here multiple GlycoProtein proteins have been considered which is found common for SARS-CoV2 different variants. The tool will align the sequences and save the aligned sequences in FASTA format. shows the phylogenetic or evolution tree of Spike GlycoProteins.

Figure 4. Alignment of putative spike GlycoProteins.

Figure 4. Alignment of putative spike GlycoProteins.

Figure 5. Phylogenetic tree of putative spike glycoproteins.

Figure 5. Phylogenetic tree of putative spike glycoproteins.

shows the alignment of Nucleocapsid protein sequences, here multiple Nucleocapsid proteins have been considered which is found common for SARS-CoV2 different variants. The tool will align the sequences and save the aligned sequences in FASTA format. shows the phylogenetic or evolution tree of Nucleocapsid proteins.

Figure 6. Alignment of nucleocapsid proteins.

Figure 6. Alignment of nucleocapsid proteins.

Figure 7. Phylogenetic tree of nucleocapsid proteins.

Figure 7. Phylogenetic tree of nucleocapsid proteins.

shows the alignment of Spike protein sequences, here multiple Spike proteins have been considered which is found common for SARS-CoV2 different variants. The tool will align the sequences and save the aligned sequences in FASTA format and shows the phylogenetic or evolution tree of Spike proteins.

Figure 8. Alignment of spike proteins.

Figure 8. Alignment of spike proteins.

Figure 9. Phylogenetic tree of spike proteins.

Figure 9. Phylogenetic tree of spike proteins.

shows the alignment of ORF1a protein sequences, here multiple ORF1a proteins have been considered which is found common for SARS-CoV2 different variants. The tool will align the sequences and save the aligned sequences in FASTA format. and shows the phylogenetic or evolution tree of ORF1a proteins.

Figure 10. Alignment of ORF1a proteins.

Figure 10. Alignment of ORF1a proteins.

Figure 11. Phylogenetic tree of ORF1a proteins.

Figure 11. Phylogenetic tree of ORF1a proteins.

Now, after alignment and generation of evolution tree for the protein sequences, the next step Seq-2-Seq LSTM based encoder-decoder model is proposed in work. To predict the amino acid sequences and the future mutations deep learning approach is used. Below is . This table shows that the various protein sequences are being trained to predict the future sequence mutation, which consists of the time taken to perform each step and the associated loss. The model is trained for 50 epochs, with batch size 10 for optimization. The model consists of adam optimizer. The learning rate considered is 0.001 with 100 hidden neurons. The dropout value considered is 0.5. The proposed model is trained in Colab. The metric considered is accuracy for model performance

Table 4. Training of various protein sequences [sample].

Below we can see the similarity between various amino acid sequences or protein sequences. The analysis should show that the sequences were more similar to each other. Very few mutations were found in the sequences. This result is based on the dataset which is used. shows the average pairwise similarity percentage between amino acid sequences of SARS-Cov2.

Table 5. Average similarity percentage of amino acid sequences (pairwise).

Finally, the predicted amino acid sequences are approximately 98% similar to the trained amino acid sequence, and the mutations observed were negligible due to the high similarity between the amino acid sequences. The accuracy metrics basically determine the correct predictions that a trained deep-learning model achieves.

5. Conclusion

In summary, mutations in SARS-CoV-2 can have significant impacts on pathogenicity, diagnostics, therapeutics, and vaccines. It is important for researchers and public health officials to monitor the evolution of the virus and its mutations to ensure that diagnostic tests, treatments, and vaccines remain effective. Deep learning algorithms play a significant role in bioinformatics. Various deep learning algorithms can be used to do tasks such as sequence categorization and prediction in a short amount of time. Our research focused on predicting mutations and computing similarities between protein sequences. The proposed LSTM model obtained an average prediction accuracy of 98% for all protein sequences included in the study, that is, spike, replicase, putative, ORF1a, Nucleocapsid, and PolyProtein. This prediction is beneficial in developing drugs for specific altered protein sequences. Finally, this study has shown that SARS-CoV2 sequence prediction is possible in the future. Bioinformatics tools and a deep learning-based technique were used to examine and visualize amino acid similarities, mutation prediction, and phylogenetic analysis. Despite various research still, there are various other challenges also, like considering the global information of proteins and finding the changes in predicting mutation.

Authors’ contributions

All the authors have equal contributions to completing the manuscript.

Availability of data and materials

All the data available with authors, we will supply as demand comes from the reviewer.

Disclosure statement

There is no conflict of Interest between the author.

References

  • Carvalho, P. C., Fischer, J. S., & Chen, E. I. (2009). DomProtein explorer: A tool for exploring domain-domain interactions in protein structures. Bioinformatics, 25(9), 1235–1236.
  • Centers for Disease Control and Prevention (2021). Emerging SARS-CoV-2 variants. https://www.cdc.gov/coronavirus/2019-ncov/more/science-and-research/scientific-brief-emerging-variants.html.
  • Chen, J., Gao, K., Wang, R., & Wei, G.-W. (2021). Prediction and mitigation of mutation threats to COVID-19 vaccines and antibody therapies. Chemical Science, 12(20), 6929–6948. doi:10.1039/d1sc01203g
  • Edgar, R. C. (2004). MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–1797. doi:10.1093/nar/gkh340
  • Gorbalenya, A. E., & Lauber, C. (2017). Phylogeny of viruses reference module in biomedical sciences.
  • Hall, T. (1999). BioEdit: A user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symposium Series, 41, 95–98.
  • Hall, T. (2004). "BioEdit version 7.0. 0." Distributed by the author, website: www. mbio. ncsu. edu/BioEdit/bioedit.Html.
  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
  • Korber, B., Fischer, W. M., Gnanakaran, S., Yoon, H., Theiler, J., Abfalterer, W., … Bhattacharya, T. (2020). Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2. bioRxiv. https://www.biorxiv.org/content/10.1101/2020.04.29.069054v2.
  • Koumakis, L. (2020). Deep learning models in genomics. Computational and Structural Biotechnology Journal, 18, 1466–1473. doi:10.1016/j.csbj.2020.06.017
  • Kumar, S., Stecher, G., & Tamura, K. (2016). MEGA7: Molecular evolutionary genetics analysis version 7.0 for bigger datasets. Molecular Biology and Evolution, 33(7), 1870–1874. doi:10.1093/molbev/msw054
  • Lauring, A. S., & Hodcroft, E. B. (2021). Genetic variants of SARS-CoV-2-what do they mean? JAMA, 325(6), 529–531. doi:10.1001/jama.2020.27124
  • Lopez-Rincon, A., Tonda, A., Mendoza-Maldonado, L., Mulders, D. G. J. C., Molenkamp, R., Perez-Romero, C. A., … Kraneveld, A. D. (2021). Classi_cation and speci_c primer design for accurate detection of SARS-CoV-2 using deep learning. Sci. Rep., Vol, 11(1), 1–11.
  • Lu, R., Zhao, X., Li, J., Niu, P., Yang, B., Wu, H., … Bi, Y. (2020). Genomic characterisation and epidemiology of 2019 novel coronavirus: Implications for virus origins and receptor binding. Lancet, 395(10224), 565–574.
  • Mohamed, T., Sayed, S., Salah, A., Houssein, E. H. (2021). Long short-term memory neural networks for RNA viruses mutations prediction. Mathematical Problems in Engineering, Article ID 9980347, 9. doi:10.1155/2021/9980347
  • National Center for Biotechnology Information (NCBI) Bethesda (MD). (1988). National Library of Medicine (US), National Center for Biotechnology Information; https://www.ncbi.nlm.nih.gov/. Accessed 30 January 2022.
  • Nawaz, M. S., Fournier-Viger, P., Shojaee, A., & Fujita, H. (2021). Using artificial intelligence techniques for COVID-19 genome analysis. Applied Intelligence (Dordrecht, Netherlands), 51(5), 3086–3103. doi:10.1007/s10489-021-02193-w
  • Nguyen, T. T., Pathirana, P. N., Nguyen, T., Nguyen, Q. V. H., Bhatti, A., Nguyen, D. C., … Abdelrazek, M. (2021). Genomic mutations and changes in protein secondary structure and solvent accessibility of SARS-CoV-2 (COVID-19 virus). Scientific Reports, 11(1), 1–16.
  • Pathan, R. K., Biswas, M., & Khandaker, M. U. (2020). Time series prediction of COVID-19 by mutation rate analysis using recurrent neural network-based LSTM model. Chaos, Solitons, and Fractals, 138, 110018. doi:10.1016/j.chaos.2020.110018
  • Rambaut, A., Holmes, E. C., O'Toole, Á., Hill, V., McCrone, J. T., Ruis, C., … Pybus, O. G. (2020). A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature Microbiology, 5(11), 1403–1407. doi:10.1038/s41564-020-0770-5
  • Sah, S., Dr.Surendiran, B., Dr.Dhanalakshmi, R., & Kamerkar, A. (2021). Classification and alignment of SARS-CoV2 sequences using machine learning approach. International Journal of Advanced Research in Management, Architecture, Technology and Engineering, 7, 34–44.
  • Sah, S., Surendiran, B., & Dhanalakshmi, R. (2023). Genomic sequence similarity of SARS-CoV2 nucleotide sequences using biopython: Key for finding cure and vaccines. In Application of deep learning methods in healthcare and medical science (pp. 211–223). USA: Apple Academic Press.
  • Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology, 26(10), 1135–1145. doi:10.1038/nbt1486
  • Sievers, F., & Higgins, D. G. (2018). Clustal Omega for making accurate alignments of many protein sequences. Protein Science: A Publication of the Protein Society, 27(1), 135–145. doi:10.1002/pro.3290
  • Smith, Y. (2019). Amino acids and protein sequences news. https://www.news-medical.net/life-sciences/Amino-Acids-and-Protein-Sequences.aspx. Accessed 26 Feb 2019
  • Stranger, B. E., & Dermitzakis, E. T. (2006). From DNA to RNA to disease and back: The ‘central dogma’ of regulatory disease variation Hum. Genomics, 2(6), 383–390.
  • Taly, J. F., Magis, C., Bussotti, G., Chang, J. M., Di Tommaso, P., Erb, I., … Notredame, C. (2011). The coffee served blind: A new view on the multiple sequence alignment problem. PLoS One. 6(12), e28817. doi:10.1371/journal.pone.0028817
  • Tomita, N., Mori, H., & Mochizuki, T. (2015). An efficient way of selecting multiple sequences for BioEdit. Bioscience, Biotechnology, and Biochemistry, 79(12), 2013–2015.
  • Wang, R., Chen, J., Gao, K., & Wei, G.-W. (2021). Vaccine-escape and fast-growing mutations in the United Kingdom, the United States, Singapore, Spain, India, and other COVID-19-devastated countries. Genomics, 113(4), 2158–2170. doi:10.1016/j.ygeno.2021.05.006
  • Whata, A., & Chimedza, C. (2021). Deep learning for SARS COV-2 genome sequences. IEEE Access: Practical Innovations, Open Solutions, 9, 59597–59611. doi:10.1109/ACCESS.2021.3073728
  • World Health Organization (2021). Tracking SARS-CoV-2 variants. https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/.
  • Xu, J., Guo, H. C., Wei, Y. Q., Shu, L., Wang, J., Li, J. S., … Sun, S. Q. (2015). Phylogenetic analysis of canine parvovirus isolates from Sichuan and Gansu provinces of China in 2011. Transboundary and Emerging Diseases, 62, 91–95.
  • Yan, S., & Wu, G. (2021). Neural network to predict probabilistically possible mutations in hemagglutinins from Eurasia H1 influenza A virus. In 2nd International Conference on Computer Vision, Image, and Deep Learning, vol. 11911, pp. 283–289. SPIE.
  • Zhou, B., Zhou, H., Zhang, X., Xu, X., Chai, Y., Zheng, Z., … Zhou, Z. (2023). TEMPO: A transformer-based mutation prediction framework for SARS-CoV-2 evolution. Computers in Biology and Medicine, 152, 12–21.

Annexure 1

Table A1. Detailed description of protein sequence data.