704
Views
0
CrossRef citations to date
0
Altmetric
Research Paper

MRM-BERT: a novel deep neural network predictor of multiple RNA modifications by fusing BERT representation and sequence features

&
Pages 1-10 | Accepted 02 Feb 2024, Published online: 15 Feb 2024

ABSTRACT

RNA modifications play crucial roles in various biological processes and diseases. Accurate prediction of RNA modification sites is essential for understanding their functions. In this study, we propose a hybrid approach that fuses a pre-trained sequence representation with various sequence features to predict multiple types of RNA modifications in one combined prediction framework. We developed MRM-BERT, a deep learning method that combined the pre-trained DNABERT deep sequence representation module and the convolutional neural network (CNN) exploiting four traditional sequence feature encodings to improve the prediction performance. MRM-BERT was evaluated on multiple datasets of 12 commonly occurring RNA modifications, including m6A, m5C, m1A and so on. The results demonstrate that our hybrid model outperforms other models in terms of area under receiver operating characteristic curve (AUC) for all 12 types of RNA modifications. MRM-BERT is available as an online tool (http://117.122.208.21:8501) or source code (https://github.com/abhhba999/MRM-BERT), which allows users to predict RNA modification sites and visualize the results. Overall, our study provides an effective and efficient approach to predict multiple RNA modifications, contributing to the understanding of RNA biology and the development of therapeutic strategies.

Background

Chemical modification is a highly specific and effective way to regulate the function of biomolecules. Almost all types of RNA undergo chemical modification. At least 160 types of chemical modifications have been found in RNA [Citation1], which play important roles in biological processes such as transcription, splicing, translation and gene silencing, affecting the structure, function and interaction of RNA [Citation2]. Therefore, RNA modifications are crucial for almost every aspect of the RNA life cycle and various cellular and developmental processes. RNA modifications also play important roles in diseases, for example, the m6A methylation modification writer (the catalysing enzyme of RNA modification) methyltransferase 3 (METTL3) can promote the proliferation of acute myeloid leukaemia (AML) cells [Citation3] and the deficiency of pseudouridine synthase 10 (PUS10), a writer of pseudouridine modification, can protect prostate cancer cells from tumour necrosis factor-related apoptosis-inducing ligand (TRAIL)-induced apoptosis [Citation4].

The target and position of RNA modifications are of the utmost importance in determining the effects of RNA modifications, so it is particularly important to accurately and efficiently identify the RNA modification sites [Citation5]. Recently developed high-throughput sequencing techniques for detecting RNA modification sites have greatly accelerated the functional study of RNA modifications. Taking m6A methylation as an example, there are various sequencing approaches for m6A identification including antibody-dependent methods like MeRIP-seq [Citation6] and miCLIP [Citation7] and antibody-independent methods like m6A-REF-seq [Citation8], MAZTER-seq [Citation9] and DART-seq [Citation10]. However, these experimental methods are expensive, labour-intensive, and highly sensitive to the experimental conditions. For example, MeRIP-seq requires a large amount of RNA input and cannot map m6A sites at single-base resolution, miCLIP requires special antibodies and sophisticated enzyme crosslinking procedures, m6A-REF-seq and MAZTER-seq can cover only a subset of modification sites, while DART-seq is also complex to conduct. Therefore, there is an urgent need for computational methods that can predict RNA modification sites based on multimodal sequencing data generated by various conventional sequencing methods.

Several algorithms have been proposed to predict RNA modification sites based on RNA sequence features, such as SRAMP [Citation11], DL-M6A [Citation12], WHISTLE [Citation13], Gene2vec [Citation14], MTDeepM6A-2S [Citation15], IRNA-m5U [Citation16], m5C-HPromoter [Citation17], and pRNAm-PC [Citation18] These predictors extract various sequence features and utilize different machine learning or deep learning framework to predict RNA modification sites with improving accuracy. However, these methods also have some limitations, such as most of the previous predictors focus only on one specific type of modification, their datasets are mostly from one experimental method, and the sequence representation is not fully comprehensive. To this end, MulitRM [Citation19] assembled datasets of 12 common modifications from multiple experiments and used the word2vec [Citation20] embedding method combined with a long short-term memory (LSTM) attention network to predict RNA modification sites, achieving simultaneous prediction of multiple types of modifications with competitive performance. On the other hand, we note that the sequence representation could be enriched by diverse sequence embeddings and encodings, while the predictor could also be further calibrated with more sizable test samples.

In this study, we proposed a novel method for predicting multiple types of RNA modification sites by combining pre-trained models and various feature encodings. The term pre-trained models here is specifically referred to the deep neural networks that are pre-trained on large-scale unlabelled data. Such models can take full advantage of the large amount of unlabelled data, learn general features and representations of sequences, and thus relieve the limited availability and quality of labelled data, which is also often the case for RNA modifications. Pre-trained models have achieved great success in the field of natural language processing (NLP) [Citation21]. For example, bidirectional encoder representations from transformers (BERT) [Citation22], GPT-3 [Citation23] and other models have reached the state-of-the-art performance in various NLP tasks. Similarly, DNA or RNA sequences can also be considered as a kind of language, with certain grammatical and semantic rules. Therefore, deep features and representations of DNA or RNA sequences can be learned through pre-trained models, and then transferred to learn downstream prediction tasks to improve the accuracy and efficiency of prediction. A typical representative of pre-trained models is DNABERT [Citation24]. As its name implies, DNABERT treats DNA sequences as a kind of text that is similar to human language, and uses two tasks, i.e. masked language model (MLM) and next sentence prediction (NSP) to learn the semantic representation of DNA sequences. DNABERT achieves better results than traditional deep learning models in downstream tasks such as genome fragment classification, genome fragment similarity calculation and genome fragment clustering. Nonetheless, RNA sequences have implicated structural and functional information that are not fully represented by their DNA counterpart. As the result, DNABERT, although highly comprehensive, is not well suited for RNA modification site prediction. To overcome this limitation, we devised a deep learning framework, MRM-BERT, that can fine-tune the pre-trained model DNABERT and enhance the performance of the prediction model by integrating multiple sequence features through a convolutional neural network. We used this method to predict 12 common RNA modification sites from multiple datasets. In the following sections, we will first introduce the establishment of MRM-BERT, and then demonstrate the performance and usage of this predictor.

Methods and materials

Dataset

The RNA modification site data were derived from the MultiRM dataset [Citation19], which consisted of datasets generated by 20 different base-resolution techniques covering 12 different types of RNA modifications (m6A, m1A, m5C, m5U, m6Am, m7G, Ψ, AtoI, Am, Cm, Gm and Um). It is noteworthy that the MultiRM dataset had already discarded modification sites identified by ambiguous techniques such as ordinary MeRIP-seq methylation peak calling combined with motif search. As for the negative samples (non-modification sites), they were randomly sampled from the non-modified bases of the same transcript containing the positive sites [Citation19]. In total, the whole dataset contains more than 300,000 sites.

We adopted two strategies for dataset partition. First, to facilitate direct comparison with MultiRM, we used their data partition strategy as ‘MultiRM’s fixed validation & testing’. We divided the dataset of each RNA modification type into three groups, i.e. training set, validation set and test set. We required that both the positive and negative test samples used here were exactly the same as those used by MultiRM. Second, we noted that the proportions of validation and test sets in the original partition of MultiRM were small, and the model could be better calibrated and evaluated with more testing samples. Therefore, we also used the ‘K-fold cross-validation & testing’ data partition strategy with training-validation-test setting, i.e. randomly taking 10% of all samples as the test set, and then dividing the remaining 90% of the data evenly into five parts. Each time, different four parts are used for training and the remaining 1 part for validation, repeated five times.

To address the issue about overlapping/redundant sequence windows, we performed CD-HIT-EST on each modification dataset before model training and testing, removing sequences with a similarity of no less than 0.9. More specifically, for the K-fold cross-validation and testing, the intra-dataset similarities of both training and test sets are <0.9, and the inter-dataset similarity between training and test sets is also <0.9. As for the MultiRM’s fixed validation and testing, to ensure a fair comparison with MultiRM, we did not remove redundant sequences within the test set; however, the inter-dataset similarity between training and test sets and the intra-dataset similarity of training set was still reduced to <0.9 via CD-HIT-EST de-redundancy procedure.

In addition, to ensure a completely independent test set, we collected m6A, m5C, Psi, and m1A sites from mouse data downloaded from m6A-Atlas [Citation25], m5C-Atlas [Citation26] and RMBase 3.0 [Citation27,Citation28]. Other modification types had insufficient data to constitute a testing set. CD-HIT-EST-2D was applied to eliminate all mouse testing sequences similar to human training sequences as what has been described above. Data partitions and their respective purposes are shown in Supplementary Figure S1.

Selection of sequence length

To determine the optimal sequence length, we tested sequence lengths of 21, 31, 41, … , 121,131,141. Results indicate a steady increase in the average AUC from length 21 to 101. However, at lengths 121 to 141, there is no significant improvement in AUC compared to length 101 (Supplementary Figure S2). Therefore, we chose 101 as the final length of the sequence windows.

Traditional sequence feature encodings

For the RNA sequence of each sample, we retained 101 bases with the modification/non-modification site at the centre position. To establish an informative combination of traditional sequence feature encodings, different encodings were evaluated and selected by assessing the performance of XGBoost classifiers based on the sequence encodings. More specifically, the following sequence encodings were evaluated:

  1. PseKNC (Pseudo K-tuple Nucleotide Composition): This method extracts features by considering the local sequence patterns and the global sequence patterns of consecutive k nucleotides [Citation29].

  2. PseDNC (Pseudo Dinucleotide Composition): This method extracts features by considering the local sequence patterns and global sequence patterns of consecutive dinucleotides [Citation29].

  3. CKSNAP (k-spaced nucleotide pairs): This method extracts features by considering k-spaced nucleotide pairs [Citation11,Citation30].

  4. KMER: This method extracts features by considering all possible combinations of k nucleotides [Citation31].

  5. DPCP (Dinucleotide Physicochemical Properties): This method extracts features by considering the physico-chemical properties of dinucleotides [Citation32].

  6. NCP (Nucleotide Chemical Property): This method extracts features by considering the chemical properties of nucleotides [Citation33].

  7. BINARY: This method extracts features by converting RNA sequences into binary codes [Citation34].

  8. ENAC (Enhanced Nucleic Acid Composition): This method extracts features by considering the composition and arrangement of nucleic acids [Citation30].

  9. EIIP (Electron–Ion Interaction Pseudopotential): This method extracts features by considering the pseudopotential of electron–ion interaction [Citation35].

  10. ANF (Atomic Number Frequencies): This method extracts features by considering the frequencies of different atoms in RNA sequences [Citation33]

Then, we performed prediction of each modification type one by one using the above sequence encodings, using XGBoost [Citation36] as the backend classifier, additionally, we explored all possible combinations, examining the relevance of these features. Finally, NCP, EIIP, ENAC, Binary, CKSNAP and Kmer sequence encodings were chosen as the representative features, as shown in the Supplementary Table S1.

Deep learning framework

As shown in , MRM-BERT was composed by integrating two deep neural network modules: one was the CNN module that took the selected traditional sequence encodings as the input; the other was the BERT module that fine-tuned the DNABERT pre-trained model to fit the RNA modification prediction task.

Figure 1. The deep learning framework of MRM-BERT.

Figure 1. The deep learning framework of MRM-BERT.

MRM-BERT integrated two deep neural network modules to achieve the prediction of multiple RNA modifications. First, the RNA sequences around the modification sites were represented by the selected traditional feature encodings, and then a CNN module based on these traditional feature encodings was trained. Second, the sequences around the modification sites were transformed into 3-mer vectors, and a BERT module was established by fine-tuning the DNABERT representation. Finally, the fully connected (FC) layers from the two deep neural network modules were fused to obtain the final classification prediction.

  1. CNN module: The four feature matrices (NCP, EIIP, ENAC, Binary) were input into the CNN module. Considering that bidirectional LSTM would focus on the two ends of the input while the sequences near the centre position were of the utmost importance for RNA modification, CNN would be a better choice to integrate various feature encodings. The inputs were firstly passed through two one-dimensional convolutional layers and max-pooled to reduce the parameter size, resulting in four feature maps, respectively. We concatenated the four feature maps with other two feature matrices (CKSNAP and Kmer) and then passed the combined feature map through two convolutional layers, for each feature matrix, a bundle of k 1D CNN filters, rather than a single 2D CNN filter, was applied along the sequence length (n) dimension. The results of these 1D CNN filters could be finally integrated by the fully connected layer to obtain the CNN-based representation of RNA modification sites.

  2. Fine-tuned BERT module: For each sequence, we firstly performed segmentation and masking, dividing the base sequence of length 101 into 99 k-mers of length 3 (i.e. 3-mers) using a sliding window of width 3. For the training set, we masked 15% of the k-mers and replaced them with corresponding tokens. After preprocessing, we converted the obtained k-mers and masks into index embedding in the dictionary, and added position embedding to obtain a fused embedding. The fused embedding was input into the pre-trained DNABERT model that was composed of 12 transformer blocks and then passed through a fully connected layer to fine-tune the DNABERT model and obtain the BERT-based representation of RNA modification sites.

Model training procedure

The CNN and BERT modules were trained or fine-tuned using the same training and validation sets. Subsequently, the parameters of these two networks were fixed and the integrated network was trained. For the CNN module, a grid search algorithm was used, the hyper-parameters of the Adam optimizer were set to a learning rate of 1e-5, weight decay of 0.001, and the learning rate scheduler was used to determine an appropriate learning rate increase and decrease curve. For the BERT module, a pre-trained model with a k-mer of 3 was used, and through grid search, the hyper-parameters were optimized as learning rate of 3e-5, warmup percent of 0.1. These hyper-parameters were determined to gradually increase the learning rate at the beginning of training while avoiding unstable training results when the learning rate is too high. The hidden dropout probability of 0.1 could reduce overfitting and improve generalization performance, while the weight decay of 0.01 could control model complexity to avoid overfitting. Half-precision floating-point arithmetic was also used to increase training speed. All other hyper-parameters were set to default.

Evaluation metrics

We used widely applied evaluation metrics to assess the performance of the model and compare it with other methods. Specifically, the following six evaluation metrics were considered, where TP represents true positive samples, TN represents true negative samples, FP represents false positive samples, and FN represents false negative samples:

  • Accuracy refers to the proportion of correctly classified samples in the total number of samples. Its calculation formula is:

    (1) ACC=TP+TNTP+FP+FN+TN(1)

  • Sensitivity, also known as Recall, refers to the proportion of positive cases correctly classified. Its calculation formula is:

    (2) SN=TPTP+FN(2)

  • Specificity, which refers to the proportion of negative cases correctly classified. Its calculation formula is:

    (3) SP=TNTP+FP(3)

  • F1 Score, which is the harmonic mean of Precision and Recall. It is suitable for unbalanced classification problems and can be used as an evaluation indicator when the ratio of positive to negative samples is imbalanced. The calculation formula for the F1 score is:

    (4) F1=2PrecisionSNPrecision+SN(4)
    (5) Precision=TPTP+FP(5)

  • MCC is the Matthews Correlation Coefficient, a value between −1 and 1, which indicates the correlation between the classifier and the true label. Its calculation formula is:

    (6) MCC=TPTNFPFNTP+FPTP+FNTN+FPTN+FN(6)

  • AUC, the Area Under the Receiver Operating Characteristic Curve (ROC) Curve, is the metric to assess the threshold-independent overall performance of the predictors. The ROC curve is plotted with sensitivity as the y-axis and 1-specificity as the x-axis, reflecting the performance of the classifier at different thresholds. The closer the ROC curve is to the upper left corner, the higher the AUC value, and the better the predictor.

Results

Selected traditional feature encodings for the CNN module of MRM-BERT

Our framework fuses CNN deep neural network module and fine-tuned pre-trained BERT module to predict 12 common types of RNA modifications (). Given a set of modifiable sites, the network will learn to extract RNA sequence features from the nucleotide sequence and integrate them with the semantic feature encoding of the sequences, to predict the possible modification map from the RNA sequences.

The CNN module combines various traditional sequence encodings. We first evaluated the performance of different encodings and all possible combinations, then selected the best-performing ones as the favourable input of the CNN module. For this purpose, we applied K-fold cross-validation & testing partitioning setting for the dataset (see Methods section for details) and evaluated the performance on the validation set. Notably, because the performance of a CNN model is usually more sensitive to hyperparameter selection and model training strategy, we instead exploited XGBoost, an effective machine learning framework that is less sensitive to training configuration to evaluate the performance of different sequence encodings.

Indeed, not all traditional sequence encodings are informative for RNA modification site prediction. Taking AtoI, Psi, m1A, m5C and m6A, five modification types with extensive modification site annotations as examples, we noted that individual encoding methods alone often could not achieve satisfactory performance. On the other hand, among the combinations of six models, the most outstanding one has essentially reached the highest accuracy and AUC among all combinations (). Therefore, these six sequence encodings were selected as the input encodings of the CNN module.

Figure 2. Comparison of prediction performance of different sequence encodings combinations. the performance shown in this figure was evaluated on the validation set of the K-fold cross-validation & testing partitioning strategy. A. Boxplots of AUC and accuracy comparing XGBoost models combining different number of encodings for AtoI modification prediction. B. Boxplots of AUC and accuracy comparing XGBoost models combining different number of encodings for psi modification prediction. C. Boxplots of AUC and accuracy comparing XGBoost models combining different number of encodings for m [Citation1]A modification prediction. D. Boxplots of AUC and accuracy comparing XGBoost models combining different number of encodings for m [Citation6]A modification prediction. E. Boxplots of AUC and accuracy comparing XGBoost models combining different number of encodings for m [Citation5]C modification prediction. F. The best-performing feature combinations and their worst AUC rankings (at worst within 15, or at worst within 30) across all modification types.

Figure 2. Comparison of prediction performance of different sequence encodings combinations. the performance shown in this figure was evaluated on the validation set of the K-fold cross-validation & testing partitioning strategy. A. Boxplots of AUC and accuracy comparing XGBoost models combining different number of encodings for AtoI modification prediction. B. Boxplots of AUC and accuracy comparing XGBoost models combining different number of encodings for psi modification prediction. C. Boxplots of AUC and accuracy comparing XGBoost models combining different number of encodings for m [Citation1]A modification prediction. D. Boxplots of AUC and accuracy comparing XGBoost models combining different number of encodings for m [Citation6]A modification prediction. E. Boxplots of AUC and accuracy comparing XGBoost models combining different number of encodings for m [Citation5]C modification prediction. F. The best-performing feature combinations and their worst AUC rankings (at worst within 15, or at worst within 30) across all modification types.

Integration of fine-tuned BERT module further improves the overall accuracy

To demonstrate the effectiveness of the fusion model that integrates the CNN module with the fine-tuned pre-trained BERT module, we used the CNN and BERT modules separately to build the predictor. The performance evaluation is summarized in . In general, the performance of the BERT module is better than the CNN model, indicating that the pre-trained sequence representation could more sophisticatedly depict the sequence pattern around modification sites. This advantage is more pronounced for modification types with less known modification site annotation (e.g. m5U and Nm) and those with less prominent sequence motif (e.g. AtoI). Moreover, in most cases except m6Am, the model integrating CNN and BERT modules can achieve improved performances in comparison with at least one of the two modules, indicating the effectiveness of the integration.

Table 1. The AUC of the MRM-BERT model versus models using individual deep neural network module.

Performance of MRM-BERT on independent test datasets

On the validation set of the K-fold cross-validation & testing partitioning strategy, the CNN-BERT has shown competitive performance in various aspects (Supplementary Figure S3 and Table S2). To avoid the possible overestimation of performance information leakage, we further evaluated the performance of MRM-BERT on independent test samples from this data partitioning strategy.

By this dataset partitioning, the proportion of test sets is fixed at 10%, and the remaining 90% samples are evenly divided into five parts for cross-validation. There is no overlap between the test samples and the training or validation samples. Using the procedure, a sizable and independent test set was prepared for the performance evaluation. The evaluation results are summarized in , and the performance of MRM-BERT on these test samples is comparable to that on the validation samples, indicating the robustness of MRM-BERT’s performance.

Table 2. Performance metrics of MRM-BERT on the test set from the K-fold cross-validation & testing partitioning strategy.

The performance shown in this table was evaluated on the test set of the K-fold cross-validation & testing partitioning strategy.

We further compared MRM-BERT with MultiRM, the state-of-the-art predictor of multiple RNA modifications. Notably, MultiRM adopted a different dataset partitioning strategy, where a fixed number of samples (50 positive samples for each modification type) were assigned to the test set. To ensure fair comparison with MultiRM, we changed the dataset partitioning to ensure that the same positive and negative test samples were used as in MultiRM. We re-trained MRM-BERT using a dataset that excluded these test samples and compared the performance of MRM-BERT on the original test samples of MultiRM. It is noteworthy that because of the relatively small size of these test sets, the performance would be different than those observed in more sizable datasets. Nonetheless, as shown in and Supplementary Table S4, MRM-BERT show improved overall performances than MultiRM. The most prominent improvements were observed in modification types with less known modification site data or less clear sequence motif, including Am, mA [Citation1] and AtoI, with increases in AUC greater than 0.1. Considering that such modification types can be better handled by the fine-tuned BERT module, these improvements would be largely attributed to the integration of BERT pre-trained deep neural network that enables a more comprehensive representation of sequence features of RNA modifications.

Table 3. Comparison of the AUC results on the original test set of MultiRM’s fixed validation and testing partitioning strategy.

Additionally, we evaluated the cross-modification prediction performance for the 12 modification types, as shown in Supplementary Figure S4. Generally, the models performed best on the same modification type, while they also showed good performance among similar modification types (for example, the model for Am also performed well among Cm, Gm and Um).

In order to compare MRM-BERT and MultiRM on a completely independent test set, we introduced mouse data of four common RNA modification types (m6A, m5C, Psi, and m1A sites). Due to the cross-species nature of the independent test set from mouse, the performances of MRM-BERT and MultiRM on the mouse test set are not as impressive as those on the human test sets. Nonetheless, MRM-BERT could still outperform MultiRM on the mouse data, indicating its better robustness (Supplementary Table S3).

MRM-BERT online web server

To facilitate the community, we have made MRM-BERT available as an online web server. MRM-BERT web server can be accessed at http://117.122.208.21:8501. It is free to all users and does not require registration. MRM-BERT is compatible with common web browsers. It integrates the trained model parameters proposed in this paper, provides prediction and display of prediction results for RNA sequences, and provides advanced visualization tools to explore and analyse predicted modification sites in a user-friendly interactive manner. For users who would have trouble in accessing this web server, we also provided the source code in GitHub (https://github.com/abhhba999/MRM-BERT), together with the runnable models and datasets in FigShare (doi: 10.6084/m9.figshare.24873195, reference number 24,873,195).

A sample prediction result is shown in . Users simply need to paste the sequence into the corresponding window on the server and select the required modification-type prediction option. MRM-BERT will automatically perform sequence embedding, feature calculation, deep neural network inference, predict RNA modification sites, and generate functionally associated interactive graphic representations. This visualization of results is helpful for screening target sites for subsequent experimental assays.

Figure 3. A sample prediction result from the online MRM-BERT webserver.

Figure 3. A sample prediction result from the online MRM-BERT webserver.

Discussion

In this work, we have developed a multi-label deep neural network model that integrates traditional sequence feature encodings and implicated sequence representation from the fine-tuned pre-trained model. This model is capable of simultaneously predicting multiple several common RNA modifications. To fully exploit the multi-level features of RNA sequences, we employed the pre-trained model DNABERT, which was trained on a large amount of DNA data, and fine-tuned it for RNA tasks, thereby obtaining high-dimensional features applicable to RNA tasks at various levels. On this basis, we integrated a deep neural network utilizing multiple RNA sequence features and achieved the improved performances, validating the effectiveness of our model design. These results suggest the enormous potential of pre-trained models in RNA modification prediction, i.e. pre-trained models can be adapted to different types of RNA modifications, enhancing the model’s generalization capability and adaptability, and can be fine-tuned or used for transfer learning on specific tasks, improving the flexibility and scalability of the model.

The BERT module plays a greater role more often, while in a few modifications, such as m5C, m7G, Cm, CNN contributes more prediction effects. Our CNN+BERT achieved the best performance for approximately two-third of the cases. Indeed, for particular modification types, the CNN would be a little better than CNN+BERT. However, since the objective of this study is to simultaneously predict multiple modification types under the same computational framework, we chose the CNN+BERT framework, which showed superior performances for most cases. On the other hand, for ad hoc model of a specific modification type, sequence encodings, deep learning framework and inclusion of pre-trained models could be further changed and optimized, but this is not the objective of this study.

To facilitate the potential users, we have constructed a web server with a user-friendly interface, and made the dataset, code and model publicly available on the website. Our model is already capable of predicting multiple different types of RNA modifications, but in reality, its capability depends on the dataset and can be applied to other types of RNA modifications with appropriate training. Nonetheless, since the BERT representation has been fine-tuned, the deep learning framework of MRM-BERT can be easily transferred to the prediction of other types of RNA modifications, once a substantial amount of new modification data become available.

On the other hand, MRM-BERT has certain limitations. Although our model can accurately predict possible modification sites of different modification types, this does not ensure reasonable prediction of the modification states of these sites on a transcript. In other words, the in vivo modification state of each predicted site is determined by the epitranscriptome regulatory dynamics, therefore not all of the predicted sites are modified in the same transcript at the same time. Interestingly, recently emerging nanopore-based third-generation sequencing and single-cell sequencing would provide useful indications for the modification state dynamics of each modification site. Therefore, further integration with these new generation of sequencing data will be a promising direction for the context-dependent prediction of modification sites and states for dynamic epitranscriptome analysis.

In addition, directly integrating more updated feature encoding methods into the pre-trained model may further enhance the performance of the pre-trained model in extracting features, which is beneficial to subsequent predictions. However, equipment limitations make it difficult for us to spend months using GPUs for customized large-scale pre-training. As the performance of laboratory equipment improves, we hope to have the opportunity to train large pre-trained models that integrate more feature encodings ourselves in the future.

In conclusion, our results demonstrate that the use of pre-trained models to predict RNA modification sites from RNA sequences is an effective and cutting-edge computational method, bringing new opportunities and challenges to RNA modification research. We anticipate that with the advancement of deep learning methods and sequencing methods, we will be able to achieve better computational mapping of the epitranscriptome landscape in the future.

Authors’ contributions

YZ and LW conceptualized the study; LW and YZ designed the methodology; LW performed the analysis; LW and YZ established the online software; LW drafted the manuscript; YZ supervised the study and revised the manuscript. All authors read and approved the final manuscript.

Supplemental material

Supplemental Material

Download Zip (5.5 MB)

Acknowledgments

We thank Leibo Liu at Peking University for his technical assistance in configuring the server environment.

Disclosure statement

There is no relevant conflict of financial or non-financial interest.

Data availability statement

The data that support the findings of this study are openly available in FigShare at doi: 10.6084/m9.figshare.24873195, reference number 24,873,195. The source code of this study is openly available in Github at https://github.com/abhhba999/MRM-BERT

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/15476286.2024.2315384

Additional information

Funding

This study was supported by the National Key Research and Development Program of China (2021YFF1201201 to YZ) and National Natural Science Foundation of China (32070658 to YZ).

References

  • Zhao LY, Song J, Liu Y, et al. Mapping the epigenetic modifications of DNA and RNA. Protein Cell. 2020;11(11):792–808. doi: 10.1007/s13238-020-00733-7
  • Frye M, Harada BT, Behm M, et al. RNA modifications modulate gene expression during development. Science. 2018;361(6409):1346–9. doi: 10.1126/science.aau1646
  • Vu LP, Pickering BF, Cheng Y, et al. The N(6)-methyladenosine (m(6)A)-forming enzyme METTL3 controls myeloid differentiation of normal hematopoietic and leukemia cells. Nat Med. 2017;23(11):1369–1376. doi: 10.1038/nm.4416
  • Jana S, Hsieh AC, Gupta R. Reciprocal amplification of caspase-3 activity by nuclear export of a putative human RNA-modifying protein, PUS10 during TRAIL-induced apoptosis. Cell Death Dis. 2017;8(10):e3093. doi: 10.1038/cddis.2017.476
  • Roundtree IA, Evans ME, Pan T, et al. Dynamic RNA Modifications in Gene Expression Regulation. Cell. 2017;169(7):1187–200. doi: 10.1016/j.cell.2017.05.045
  • Zhang Y, Hua W, Dang Y, et al. Validated impacts of N6-methyladenosine methylated mRnas on apoptosis and angiogenesis in myocardial infarction based on MeRIP-seq analysis. Front Mol Biosci. 2021;8:789923. doi: 10.3389/fmolb.2021.789923
  • Hocq R, Paternina J, Alasseur Q, et al. Monitored eCLIP: high accuracy mapping of RNA-protein interactions. Nucleic Acids Res. 2018;46(21):11553–65. doi: 10.1093/nar/gky858
  • Gao Y, Liu X, Wu B, et al. Quantitative profiling of N(6)-methyladenosine at single-base resolution in stem-differentiating xylem of populus trichocarpa using nanopore direct RNA sequencing. Genome Biol. 2021;22(1):22. doi: 10.1186/s13059-020-02241-7
  • Pandey RR, Pillai RS. Counting the cuts: MAZTER-Seq quantifies m(6)A levels using a methylation-sensitive ribonuclease. Cell. 2019;178(3):515–517. doi: 10.1016/j.cell.2019.07.006
  • Meyer KD. DART-seq: an antibody-free method for global m(6)A detection. Nat Methods. 2019;16(12):1275–1280. doi: 10.1038/s41592-019-0570-0
  • Zhou Y, Zeng P, Li YH, et al. SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Res. 2016;44(10):e91. doi: 10.1093/nar/gkw104
  • Rehman MU, Tayara H, Chong KT. DL-m6A: identification of N6-methyladenosine sites in mammals using deep learning based on different encoding schemes. IEEE/ACM Trans Comput Biol Bioinform. 2023;20(2):904–911. doi: 10.1109/TCBB.2022.3192572
  • Chen K, Wei Z, Zhang Q, et al. WHISTLE: a high-accuracy map of the human N6-methyladenosine (m6A) epitranscriptome predicted using a machine learning approach. Nucleic Acids Res. 2019;47(7):e41. doi: 10.1093/nar/gkz074
  • Zou Q, Xing P, Wei L, et al. Gene2vec: gene subsequence embedding for prediction of mammalian N(6) -methyladenosine sites from mRNA. RNA. 2019;25(2):205–218. doi: 10.1261/rna.069112.118
  • Wang H, Zhao S, Cheng Y, et al. MTDeepM6A-2S: a two-stage multi-task deep learning method for predicting RNA N6-methyladenosine sites of Saccharomyces cerevisiae. Front Microbiol. 2022;13:999506. doi: 10.3389/fmicb.2022.999506
  • Feng P, Chen W. iRNA-m5U: a sequence based predictor for identifying 5-methyluridine modification sites in Saccharomyces cerevisiae. Methods. 2022;203:28–31. doi: 10.1016/j.ymeth.2021.04.013
  • Xiao X, Shao YT, Luo ZT, et al. m5C-HPromoter: an ensemble deep learning predictor for identifying 5-methylcytosine sites in human promoters. Curr Bioinform. 2022;17(5):452–461. doi: 10.2174/1574893617666220330150259
  • Liu Z, Xiao X, Yu DJ, et al. pRnam-PC: predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties. Anal Biochem. 2016;497:60–67. doi: 10.1016/j.ab.2015.12.017
  • Song Z, Huang D, Song B, et al. Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications. Nat Commun. 2021;12(1):4011. doi: 10.1038/s41467-021-24313-3
  • Desai A, Zumbo A, Giordano M, et al. Word2vec word embedding-based artificial intelligence model in the triage of patients with suspected diagnosis of major ischemic stroke: a feasibility study. Int J Environ Res Public Health. 2022;19(22):15295. doi: 10.3390/ijerph192215295
  • Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods. 2015;12(10):931–934. doi: 10.1038/nmeth.3547
  • Devlin J, Chang MW, Lee K, et al. Pre-training of deep bidirectional transformers for language understanding. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Naacl Hlt 2019); Minneapolis, Minnesota, USA. 2019;1:4171–4186.
  • Spitale G, Biller-Andorno N, Germani F. AI model GPT-3 (dis)informs us better than humans. Sci Adv. 2023;9(26). doi: 10.1126/sciadv.adh1850
  • Ji Y, Zhou Z, Liu H, et al. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112–20. doi: 10.1093/bioinformatics/btab083
  • Tang Y, Chen K, Song B, et al. m6A-Atlas: a comprehensive knowledgebase for unraveling the N6-methyladenosine (m6A) epitranscriptome. Nucleic Acids Res. 2021;49(D1):D134–D43. doi: 10.1093/nar/gkaa692
  • Ma J, Song B, Wei Z, et al. m5C-Atlas: a comprehensive database for decoding and annotating the 5-methylcytosine (m5C) epitranscriptome. Nucleic Acids Res. 2022;50(D1):D196–D203. doi: 10.1093/nar/gkab1075
  • Sun WJ, Li JH, Liu S, et al. Rmbase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data. Nucleic Acids Res. 2016;44(D1):D259–65. doi: 10.1093/nar/gkv1036
  • Xuan J, Chen L, Chen Z, et al. Rmbase v3.0: decode the landscape, mechanisms and functions of RNA modifications. Nucleic Acids Res. 2023;52(D1):D273–D284. doi: 10.1093/nar/gkad1070
  • Liu B, Liu F, Wang X, et al. Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015;43(W1):W65–71. doi: 10.1093/nar/gkv458
  • Chen Z, Zhao P, Li F, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. 2020;21(3):1047–57. doi: 10.1093/bib/bbz041
  • Lee D, Karchin R, Beer MA. Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011;21(12):2167–80. doi: 10.1101/gr.121905.111
  • Manavalan B, Basith S, Shin TH, et al. 4mCpred-EL: an ensemble learning framework for identification of DNA N(4)-methylcytosine sites in the mouse genome. Cells. 2019;8(11):1332. doi: 10.3390/cells8111332
  • Chen W, Tran H, Liang Z, et al. Identification and analysis of the N(6)-methyladenosine in the Saccharomyces cerevisiae transcriptome. Sci Rep. 2015;5(1):13859. doi: 10.1038/srep13859
  • Chen Z, Chen YZ, Wang XF, et al. Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs. PloS One. 2011;6(7):e22930. doi: 10.1371/journal.pone.0022930
  • Lalovic D, Veljkovic V. The global average DNA base composition of coding regions may be determined by the electron-ion interaction potential. Biosystems. 1990;23(4):311–6. doi: 10.1016/0303-2647(90)90013-Q
  • Chen TQ, Guestrin C Xgboost: a scalable tree boosting system. Kdd’16: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining; San Francisco, California, USA. 2016:785–794.