Abstract
Aim: To predict base-resolution DNA methylation in cancerous and paracancerous tissues. Material & methods: We collected six cancer DNA methylation datasets from The Cancer Genome Atlas and five cancer datasets from Gene Expression Omnibus and established machine learning models using paired cancerous and paracancerous tissues. Tenfold cross-validation and independent validation were performed to demonstrate the effectiveness of the proposed method. Results: The developed cross-tissue prediction models can substantially increase the accuracy at more than 68% of CpG sites and contribute to enhancing the statistical power of differential methylation analyses. An XGBoost model leveraging multiple correlating CpGs may elevate the prediction accuracy. Conclusion: This study provides a powerful tool for DNA methylation analysis and has the potential to gain new insights into cancer research from epigenetics.
The authors employed machine learning models to predict genome-wide DNA methylation (DNAm) levels in cancerous tissues (CTs) and paracancerous tissues (PTs) when one of them is difficult to obtain.
The proposed model based on a single CpG site achieves an improvement of mean absolute error at more than 68% of CpGs.
A multiple-CpG-based XGBoost model can further improve the predictive performance when there is considerable variability between individuals.
The detected CpG sites in differential methylation analysis are statistically more significant by combining the measured and predicted PTs to enlarge the sample size.
When using CTs as predictors instead of PTs, the prediction models have better performance.
The aggressiveness of cancers and patient outcome may be predictable using well-predicted DNAm profiles in CT/PT.
Functional enrichment analysis based on highly correlated CpG sites identified important pathways involved in cancer progression.
The cross-tumor DNAm prediction model has the potential to be applied to an external cancer dataset for a subset of probes with high correlation in both cancers.
Author contributions
Conceptualization: B Ma, S Liu, F Song and S Zhang; methodology: B Ma and S Zhang; investigation: B Ma, F Song, S Zhang and Y Liu; visualization: S Zhang; supervision: B Ma, S Liu and F Song; writing – original draft: S Zhang; writing – review and editing: B Ma, S Liu, F Song, S Zhang, Y Liu, Y Shen and D Li. All authors read and approved the final manuscript.
Financial disclosure
This work was supported by the Chinese National Key Research and Development Project (no. 2021YFC2500400) and the National Natural Science Foundation of China (no. 61471078). The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.
Competing interests disclosure
The authors have no competing interests or relevant affiliations with any organization or entity with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending or royalties.
Writing disclosure
No writing assistance was utilized in the production of this manuscript.
Data sharing statement
The source code and demo data have been deposited at: https://github.com/lab319/DNAm_prediction_CT_PT.