376
Views
0
CrossRef citations to date
0
Altmetric
Research Paper

Nm-Nano: a machine learning framework for transcriptome-wide single-molecule mapping of 2´-O-methylation (Nm) sites in nanopore direct RNA sequencing datasets

, , , &
Pages 1-15 | Accepted 01 May 2024, Published online: 17 May 2024
 

ABSTRACT

2´-O-methylation (Nm) is one of the most abundant modifications found in both mRNAs and noncoding RNAs. It contributes to many biological processes, such as the normal functioning of tRNA, the protection of mRNA against degradation by the decapping and exoribonuclease (DXO) protein, and the biogenesis and specificity of rRNA. Recent advancements in single-molecule sequencing techniques for long read RNA sequencing data offered by Oxford Nanopore technologies have enabled the direct detection of RNA modifications from sequencing data. In this study, we propose a bio-computational framework, Nm-Nano, for predicting the presence of Nm sites in direct RNA sequencing data generated from two human cell lines. The Nm-Nano framework integrates two supervised machine learning (ML) models for predicting Nm sites: Extreme Gradient Boosting (XGBoost) and Random Forest (RF) with K-mer embedding. Evaluation on benchmark datasets from direct RNA sequecing of HeLa and HEK293 cell lines, demonstrates high accuracy (99% with XGBoost and 92% with RF) in identifying Nm sites. Deploying Nm-Nano on HeLa and HEK293 cell lines reveals genes that are frequently modified with Nm. In HeLa cell lines, 125 genes are identified as frequently Nm-modified, showing enrichment in 30 ontologies related to immune response and cellular processes. In HEK293 cell lines, 61 genes are identified as frequently Nm-modified, with enrichment in processes like glycolysis and protein localization. These findings underscore the diverse regulatory roles of Nm modifications in metabolic pathways, protein degradation, and cellular processes. The source code of Nm-Nano can be freely accessed at https://github.com/Janga-Lab/Nm-Nano.

Acknowledgments

We thank Alexander Krohannon, Hunter M. Gill and Alexandre Plastow at IUI for giving valuable comments on a previous version of this manuscript.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

Nm-nano is available at the Github repository https://github.com/Janga-Lab/Nm-Nano. The directRNA-sequencing data generated in this study for HEK293, and HeLa cell lines are publicly available on SRA, under the project accession PRJNA685783 and PRJNA604314, respectively.

Author contributions

DH, AA, and SCJ conceived and designed the study. DH implemented the Nm-Nano Github software version. AA and DH implemented the Nm modifications ML predictors namely XGBoost and RF with K-mer embedding respectively. DH extracted the benchmark datasets. AA tuned the parameters of XGBoost using the grid-search algorithm. AA and DH evaluated the performance of XGBoost and RF with K-mer embedding models with the random test split and integrated validation testing. DH identified the unique Nm genomic locations and the top modified RNA bases with Nm sites on HeLa and HEK293 cell lines. SVD performed gene length distribution analysis, functional and gene set enrichment analysis. QM performed the cell culturing, RNA library preparation and Nanopore RNA Sequencing for HeLa and HEK293 cell lines.

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/15476286.2024.2352192

Additional information

Funding

This work is supported by the National Science Foundation (NSF) grant [#1940422 and #1908992] as well as the National Institute of General Medical Sciences of the National Institutes of Health under Award Number [R01GM123314] (SCJ).