940
Views
0
CrossRef citations to date
0
Altmetric
Research Paper

BioDeepfuse: a hybrid deep learning approach with integrated feature extraction techniques for enhanced non-coding RNA classification

, , , , , , , & show all
Pages 1-12 | Accepted 23 Jan 2024, Published online: 25 Mar 2024
 

ABSTRACT

The accurate classification of non-coding RNA (ncRNA) sequences is pivotal for advanced non-coding genome annotation and analysis, a fundamental aspect of genomics that facilitates understanding of ncRNA functions and regulatory mechanisms in various biological processes. While traditional machine learning approaches have been employed for distinguishing ncRNA, these often necessitate extensive feature engineering. Recently, deep learning algorithms have provided advancements in ncRNA classification. This study presents BioDeepFuse, a hybrid deep learning framework integrating convolutional neural networks (CNN) or bidirectional long short-term memory (BiLSTM) networks with handcrafted features for enhanced accuracy. This framework employs a combination of k-mer one-hot, k-mer dictionary, and feature extraction techniques for input representation. Extracted features, when embedded into the deep network, enable optimal utilization of spatial and sequential nuances of ncRNA sequences. Using benchmark datasets and real-world RNA samples from bacterial organisms, we evaluated the performance of BioDeepFuse. Results exhibited high accuracy in ncRNA classification, underscoring the robustness of our tool in addressing complex ncRNA sequence data challenges. The effective melding of CNN or BiLSTM with external features heralds promising directions for future research, particularly in refining ncRNA classifiers and deepening insights into ncRNAs in cellular processes and disease manifestations. In addition to its original application in the context of bacterial organisms, the methodologies and techniques integrated into our framework can potentially render BioDeepFuse effective in various and broader domains.

Acknowledgments

The authors would like to thank USP, CAPES, CNPq, FAPESP, AI4PEP, IDRC, FEMS, ARIS, and HIDA for the financial support for this research. We also thank Denny Popp for his assistance in acquiring the data used in the study.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Availability of data and materials

The documentation, pipeline, images, and results from the models are available in the GitHub repository: https://github.com/brenoslivio/BioDeepFuse. The complete set of 48 new bacterial genomes (and their sequences in FASTA format) are available on the long-term data archive at the Helmholtz Center for Environmental Research – UFZ data centre using the link (https://www.ufz.de/record/dmp/archive/14024).

Additional information

Funding

This project was supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior [CAPES] - Universidade de São Paulo [USP]; grant [#2023/00264-0], São Paulo Research Foundation [FAPESP]; Canada’s International Development Research Centre [IDRC] - Grant No.109981; and HIDA – Helmholtz Information and Data Science Academy. The work performed by AVS was supported by a FEMS (Federation of European Microbiological Societies) research and training grant and the Helmholtz Information & Data Science Academy visiting research grant. The work performed by PS and IMM was supported by ARIS project J1-4411.