Abstract
Genomic and transcriptomic data development has provided new prospects for biomarker identification and cancer prediction. However, it is challenging to capture the biological dataset with complex and nonlinear associations using existing biomarkers and cancer diagnosis techniques. Machine learning offers enormous potential for creating feature selection techniques and models to identify cancer biomarkers. In this article, we propose a Hierarchical Biomarker Selection and Stacked Ensemble model for Biomarker Identification and Cancer Prediction (HBS–STACK) on miRNA, gene expression, and DNA Methylation (DM) datasets. Three-stage biomarker selection is developed comprising an aggregation of information between CpG sites and genes by considering the biological relations at stage 1, Fold Change and False Discovery Rate selection at stage 2, and Light Gradient Boosting Machine with Recursive Feature Elimination (LBGMRFE) selection at stage 3. The selected features and markers are integrated and passed to stacked ML models comprising Gradient Boosting Machine (GBM), Naïve Bayes (NB), Random Forest (RF) at level 1 learning, and DNN at level 2 learning. HBS–STACK is evaluated on breast cancer (BRCA) and is validated on kidney renal clear cell carcinoma (KIRC) from TCGA (The Cancer Genome Atlas) Portal and on Alzheimer Disease. We found several genomic and transcriptomic biomarkers comprising IQSEC1 for BRCA, ZFHX3, CTBP2, and SLC9AR2 for KIRC and TMEM61 for Alzheimer disease, respectively. The experimental results show that the HBS–STACK outperformed GBM, NB, and RF with 99.60, 99.03, and 92.05% accuracy and shows an improvement of 2.27, 26.03, 10.05% in performance compared with existing techniques on BRCA, KIRC, and Alzheimer, respectively.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets will be made available on suitable request.
References
Vargas AJ, Harris CC (2019) Cancer as a case study. Biomakers 16:525–537. https://doi.org/10.1038/nrc.2016.56.Biomarker
One in every 15 Indians will die of cancer, says WHO report. https://theprint.in/health/one-in-every-15-indians-will-die-of-cancer-says-who-report/359394/. Accessed 14 Feb 2022
Smith TR, Miller MS, Lohman KK et al (2003) DNA damage and breast cancer risk. Carcinogenesis 24:883–889. https://doi.org/10.1093/carcin/bgg037
Raweh AA, Nassef M, Badr A et al (2020) Identifying a miRNA signature for predicting the stage of breast cancer. Cancers (Basel) 12:1–14. https://doi.org/10.18632/oncotarget.2915
Das T, Andrieux G, Ahmed M, Chakraborty S (2020) Integration of online omics-data resources for cancer research. Front Genet 11:1–24. https://doi.org/10.3389/fgene.2020.578345
Reel PS, Reel S, Pearson E et al (2021) Using machine learning approaches for multi-omics data analysis: a review. Biotechnol Adv 49:107739. https://doi.org/10.1016/j.biotechadv.2021.107739
Lazar C, Taminau J, Meganck S et al (2012) Survey of filter techniques for feature selection in MicroArrays. IEEE Trans Comput Biol Bioinform 9:1106–1119
Raweh AA, Nassef M, Badr A (2018) A hybridized feature selection and extraction approach for enhancing cancer prediction based on DNA methylation. IEEE Access 6:15212–15223. https://doi.org/10.1109/ACCESS.2018.2812734
Yasuda T, Bateni M, Chen L, et al (2022) Sequential attention for feature selection, pp 1–21
Zhao, Z., Zhang, Y., Harinen, T., Yung M (2022) Feature selection methods for uplift modeling and heterogeneous treatment effect. In: IFIP international conference on artificial intelligence applications and innovations. Springer: Cham, pp 217–230
Tang XF, Shi Z, Jin M (2021) Multi-category multi-state information ensemble-based classification method for precise diagnosis of three cancers. Neural Comput Appl 33:15901–15917. https://doi.org/10.1007/s00521-021-06211-3
Huang MW, Chen CW, Lin WC et al (2017) SVM and SVM ensembles in breast cancer prediction. PLoS ONE 12:1–14. https://doi.org/10.1371/journal.pone.0161501
Cho S-B, Won H-H (2003) Machine learning in DNA microarray analysis for cancer classification. Proc First Asia-Pacific Bioinform Conf Bioinform 19:189–198
Sun L, Zhang X, Qian Y et al (2019) Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification. Inf Sci (N Y) 502:18–41. https://doi.org/10.1016/j.ins.2019.05.072
Li L, Ching WK, Liu ZP (2022) Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods. Comput Biol Chem 100:107747. https://doi.org/10.1016/j.compbiolchem.2022.107747
Liaw A, Wiener M (2002) The R Journal: classification and regression by randomForest. R Journal 2:18–22
Genomic Data Commons Data Portal. https://portal.gdc.cancer.gov/. Accessed 10 Jan 2022
Rehman O, Zhuang H, Ali AM, Ibrahim A (2019) Validation of miRNAs as breast cancer biomarkers with a machine learning approach. Cancers (Basel) 11:431. https://doi.org/10.3390/cancers11030431
Danaee P, Ghaeini R, Hendrix DA (2017) A deep learning approach for cancer detection and relevant gene identification. Pac Symp Biocomput. https://doi.org/10.1142/9789813207813_0022
Alghunaim S, Al-Baity HH (2019) On the scalability of machine-learning algorithms for breast cancer prediction in big data context. IEEE Access 7:91535–91546. https://doi.org/10.1109/ACCESS.2019.2927080
Jeon H, Oh S (2020) Hybrid-recursive feature elimination for efficient feature selection. Appl Sci 10(9):1–8
Zhang G, Xue Z, Yan C et al (2021) A novel biomarker identification approach for gastric cancer using gene expression and DNA methylation dataset. Front Genet. https://doi.org/10.3389/fgene.2021.644378
Wang T, Shao W, Huang Z et al (2021) MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun 12:1–13. https://doi.org/10.1038/s41467-021-23774-w
Choi JM, Chae H (2023) moBRCA-net: a breast cancer subtype classification framework based on multi-omics attention neural networks. BMC Bioinform 24:1–15. https://doi.org/10.1186/s12859-023-05273-5
Garzon R, Fabbri M, Cimmino A et al (2006) MicroRNA expression and function in cancer. Trends Mol Med 12:580–587. https://doi.org/10.1016/j.molmed.2006.10.006
Wessely F, Emes RD (2012) Identication of DNA methylation biomarkers from Innium arrays. Front Genet 3:1–8. https://doi.org/10.3389/fgene.2012.00161
Shobha G, Rangaswamy S (2018) Machine learning, 1st edn. Amsterdam, Elsevier
Yiu T (2019) Understanding Random Forest. https://towardsdatascience.com/understanding-random-forest-58381e0602d2. Accessed 2 Mar 2022
Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Front Neurorobot. https://doi.org/10.3389/fnbot.2013.00021
Montavon G, Samek W, Müller KR (2018) Methods for interpreting and understanding deep neural networks. Digit Signal Proc Rev J 73:1–15. https://doi.org/10.1016/j.dsp.2017.10.011
Pavlyshenko B (2018) Using stacking approaches for machine learning models. In: Proceedings of the 2018 IEEE 2nd international conference on data stream mining and processing, DSMP 2018 pp. 255–258. https://doi.org/10.1109/DSMP.2018.8478522
Stacked Models, Hands-On Machine Learning with R (2020). https://bradleyboehmke.github.io/HOML/stacking.html. Accessed 12 Jan 2022
impute.knn: A function to impute missing expression data. https://www.rdocumentation.org/packages/impute/versions/1.46.0/topics/impute.knn. Accessed 12 Jan 2022
Pavya K, Srinivasan DB (2017) Feature selection techniques in data mining: a study. Int J Sci Dev Res 2:594–598
Witten D (2007) A comparison of fold-change and the t-statistic for microarray data analysis. Analysis 1776:58–85
Norris AW, Kahn CR (2006) Analysis of gene expression in pathophysiological states: Balancing false discovery and false negative rates. Proc Natl Acad Sci U S A 103:649–653. https://doi.org/10.1073/pnas.0510115103
Shen Z (2020) A Novel Hybrid Classification Model - LightGBM With Neural Net. https://zitaoshen.rbind.io/project/machine_learning/a-novel-hybrid-classification-model-lightgbm-with-neural-net/. Accessed 23 Jan 2022
Wang D, Li JR, Zhang YH et al (2018) Identification of differentially expressed genes between original breast cancer and xenograft using machine learning algorithms. Genes (Basel) 9:1–15. https://doi.org/10.3390/genes9030155
Ma B, Meng F, Yan G et al (2020) Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data. Comput Biol Med 121:103761. https://doi.org/10.1016/j.compbiomed.2020.103761
Li MW, Xu DY, Geng J, Hong WC (2022) A hybrid approach for forecasting ship motion using CNN–GRU–AM and GCWOA. Appl Soft Comput 114:108084. https://doi.org/10.1016/j.asoc.2021.108084
Sultan G (2019) Towards the early detection of ductal carcinoma (a common type of breast cancer) using biomarkers linked to the PPAR(γ) signaling pathway. Bioinformation 15:799–805. https://doi.org/10.6026/97320630015799
Hunter S, Nault B, Ugwuagbo KC et al (2019) Mir526b and mir655 promote tumour associated angiogenesis and lymphangiogenesis in breast cancer. Cancers (Basel). https://doi.org/10.3390/cancers11070938
Martinez-Ledesma E, Verhaak RGW, Treviño V (2015) Identification of a multi-cancer gene expression biomarker for cancer clinical outcomes using a network-based algorithm. Sci Rep 5:1–14. https://doi.org/10.1038/srep11966
Salas LA, Johnson KC, Koestler DC et al (2017) Integrative epigenetic and genetic pan-cancer somatic alteration portraits. Epigenetics 12:561–574. https://doi.org/10.1080/15592294.2017.1319043
Zhu H, Lu J, Zhao H et al (2018) Functional long noncoding RNAs (IncRNAs) in clear cell kidney carcinoma revealed by reconstruction and comprehensive analysis of the lncRNA–miRNA–mRNA regulatory network. Med Sci Monit 24:8250–8263. https://doi.org/10.12659/MSM.910773
Zong X, Fu J, Wang Z, Wang Q (2022) The diagnostic and prognostic values of HOXA gene family in kidney clear cell renal cell carcinoma. J Oncol 2022:1–14. https://doi.org/10.1155/2022/1762637
Han G, Zhao W, Song X et al (2017) Unique protein expression signatures of survival time in kidney renal clear cell carcinoma through a pan-cancer screening. BMC Genom. https://doi.org/10.1186/s12864-017-4026-6
Zheng X, Song T, Dou C et al (2015) CtBP2 is an independent prognostic marker that promotes GLI1 induced epithelial-mesenchymal transition in hepatocellular carcinoma. Oncotarget 6:3752–3769. https://doi.org/10.18632/oncotarget.2915
Aboulouard S, Wisztorski M, Duhamel M et al (2021) In-depth proteomics analysis of sentinel lymph nodes from individuals with endometrial cancer. Cell Rep Med 2:100318. https://doi.org/10.1016/j.xcrm.2021.100318
Ali M, Archer DB, Gorijala P et al (2023) Large multi-ethnic genetic analyses of amyloid imaging identify new genes for Alzheimer disease. Acta Neuropathol Commun 11:1–20. https://doi.org/10.1186/s40478-023-01563-4
Vasanthakumar A, Davis JW, Idler K et al (2020) Harnessing peripheral DNA methylation differences in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) to reveal novel biomarkers of disease. Clin Epigenet 12:1–11. https://doi.org/10.1186/s13148-020-00864-y
Silva GJJ, Bye A, el Azzouzi H, Wisløff U (2017) MicroRNAs as important regulators of exercise adaptation. Prog Cardiovasc Dis 60:130–151. https://doi.org/10.1016/j.pcad.2017.06.003
Brownlee J (2016) Naive Bayes for machine learning. https://machinelearningmastery.com/naive-bayes-for-machine-learning/. Accessed 28 Feb 2022
Funding
The authors have no funding to report.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Ethical standards
The author declares that this article complies the ethical standard.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix I: Algorithm for proposed HBS–STACK
Appendix I: Algorithm for proposed HBS–STACK
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dhillon, A., Singh, A. & Bhalla, V.K. HBS–STACK: hierarchical biomarker selection and stacked ensemble model for biomarker identification and cancer prediction on multi-omics. Neural Comput & Applic 36, 5413–5431 (2024). https://doi.org/10.1007/s00521-023-09359-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09359-2