Framework for classification of cancer gene expression data using Bayesian hyper-parameter optimization

Koul, Nimrita; Manvi, Sunilkumar S.

doi:10.1007/s11517-021-02442-7

Framework for classification of cancer gene expression data using Bayesian hyper-parameter optimization

Original Article
Published: 05 October 2021

Volume 59, pages 2353–2371, (2021)
Cite this article

Medical & Biological Engineering & Computing Aims and scope Submit manuscript

523 Accesses
Explore all metrics

Abstract

Computational classification of cancers is an important research problem. Gene expression data has 1000s of features, very few samples, and a class imbalance problem. In this paper, we have proposed a framework for the classification of cancer gene expression profiles. The framework consists of a pipeline of methods for data pre-processing, feature selection, and classification. Data pre-processing is done by standard scaling and normalization of the features. The feature selection is performed in two steps. First, recursive feature elimination (RFE) is used; then, a genetic algorithm is applied only in case RFE results in a feature subset of size more than a specific threshold. Next, is a meta-pool of diverse, individual as well as ensemble classifiers. Hyper-parameters of each member in the meta-pool are optimized using Bayesian Optimization. An algorithm is developed to select the best classifier from the meta-pool based on classification accuracy and computation time taken. We evaluated the framework on 6 publicly available microarray datasets and the PAN-Cancer RNA Sequencing dataset. We found that the classifier selected by the proposed framework produced significant improvement in classification accuracy and computation time required to predict labels for test datasets. A detailed comparison with the state-of-the-art methods shows that the proposed framework outperforms all of them.

Graphic Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unleashing the power of machine learning in cancer analysis: a novel gene selection and classifier ensemble strategy

Article 08 January 2024

Multi-population adaptive genetic algorithm for selection of microarray biomarkers

Article 17 December 2019

Iterative ensemble feature selection for multiclass classification of imbalanced microarray data

Article Open access 04 July 2016

References

Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E et al (2018) Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell 173(2):291–304. https://doi.org/10.1016/j.cell.2018.03.022
Article CAS PubMed PubMed Central Google Scholar
TCGA dataset https://gdc.cancer.gov/about-data/publications/pancanatlas. Accessed 12 Jul 2021
Golub GTR, Slonim DK, Tamayo P, Gaasenbeek M, Huard C, Mesirov JP, Coller H, LoH M, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Article CAS Google Scholar
Alizadeh A et al (2000) Different types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403:503–511
Article CAS Google Scholar
Chiesa M, Maioli G, Colombo GI et al (2020) GARS: genetic algorithm for the identification of a Robust Subset of features in high-dimensional datasets. BMC Bioinform 21:54. https://doi.org/10.1186/s12859-020-3400-6
Article Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. IEEE Trans Knowl Data Eng 25(1):1–14
Google Scholar
Jansi RM, Devaraj D (2019) Two-stage hybrid gene selection using mutual information and genetic algorithm for cancer data classification. J Med Syst 43:235
Article Google Scholar
Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F et al (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural Networks. Nat Med 7:673–679
Article CAS Google Scholar
Pomeroy SL et al (2002) Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature 415(6870):436–442
Article CAS Google Scholar
Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209
Article CAS Google Scholar
Fletcher S, Verma B, Jan ZM, Zhang M (2018) The optimized selection of base-classifiers for ensemble classification using a multi-objective genetic algorithm. In: Proceedings of International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro. 1–8. https://doi.org/10.1109/ijcnn.2018.8489467.
Efron B (1979) Bootstrap methods: another look at the Jackknife. Ann Stat 7(1):1–26
Article Google Scholar
Chawla N, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In Proceedings of Advances in Neural Information Processing Systems, NIPS 2011, 24:2546–2554
Pedregosa et al (2011) Scikit-learn: machine learning in Python. JMLR 12:2825–2830
Google Scholar
Rana M, Ahmed K (2020) Feature selection and biomedical signal classification using minimum redundancy maximum relevance and artificial neural network. Proceedings of international joint conference on computational intelligence algorithms for intelligent systems. Springer, Singapore
Google Scholar
Nancy SG, Saranya K, Rajasekar S (2020) Neuro-fuzzy ant bee colony based feature selection for cancer classification. Springer innovations in communication and computing. Springer, Cham
Google Scholar
Shukla AK, Tripathi D (2020) Detecting biomarkers from microarray data using distributed correlation-based gene selection. Genes & Genom 42:449–465
Article CAS Google Scholar
Kourou K, Rigas G, Papaloukas C, Mitsis M, Fotiadis DI (2020) Cancer classification from time-series microarray data through regulatory Dynamic Bayesian Networks. Comput Biol Med. https://doi.org/10.1016/j.compbiomed.2019.103577
Article PubMed Google Scholar
Yanhao H, Lihui X, Chuanze K, Minghui W, Qin M, Bin Y (2020) SGL-SVM: a novel method for tumor classification via support vector machine with sparse group Lasso. J Theor Biol 486:110098. https://doi.org/10.1016/j.jtbi.2019.110098
Article CAS Google Scholar
Xiaohong H, Dengao L, Ping L, Li W (2020) Feature selection by recursive binary gravitational search algorithm optimization for cancer classification. Soft Comput 24(6):4407–4425
Article Google Scholar
Morais-Rodrigues F, Silverio-Machado R, Kato RB, Rodrigues DLN, Valdez- BJ, Fonseca V (2019) Analysis of the microarray gene expression for breast cancer progression after the application modified logistic regression. Gene. https://doi.org/10.1016/j.gene.2019.144168
Article PubMed Google Scholar
Loey M, Wajeeh JM, Hazem E-B, Hamed N, Taha M, Eldeen M, Khalifa M (2020) Breast and colon cancer classification from gene expression profiles using data mining techniques. Symmetry 12:408
Article Google Scholar
Akhand MAH, Asaduzzaman MM, Mir HK, Hafizur Rahman MM (2019) Cancer classification from DNA microarray data using mRMR and artificial neural network. Int J Adv Comput Sci Appl 10:7
Google Scholar
Zakariyal YA, Hisyam LM (2019) A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Adv Data Anal Classif 13:753–771
Article Google Scholar
Sarah AM, Saleh AI, Labib M (2019) Gene expression cancer classification using modified K-nearest neighbors technique. BioSystems 176:41–51
Article Google Scholar
Russul A, Jingyu H, Azzawi H, Yong X (2019) A novel gene selection algorithm for cancer classification using microarray datasets. BMC Med Genom 12:10
Article Google Scholar
Mignone P, Pio G, Džeroski S et al (2020) Multi- task learning for the simultaneous reconstruction of the human and mouse gene regulatory networks. Sci Rep 10:22295. https://doi.org/10.1038/s41598-020-78033-7
Article CAS PubMed PubMed Central Google Scholar
Ziba K, Marjan N, Mohammad JR (2020) Detection and classification of breast cancer using logistic regression feature selection and GMDH classifier. J Biomed Inform 111:103591. https://doi.org/10.1016/j.jbi.2020.103591
Article Google Scholar
Bong-Hyun K, Kijin Y, Peter CWL (2020) Cancer classification of single-cell gene expression data by neural network. Bioinformatics 36(5):1360–1366. https://doi.org/10.1093/bioinformatics/btz772
Article CAS Google Scholar
Way GP, Sanchez-Vega F, La K, Armenia J, Chatila WK, Luna A, Sander C, Cherniack AD, Mina M, Ciriello G, Schultz N, Sanchez Y, Greene CS (2018) Machine learning detects pan-cancer Ras pathway activation in the cancer genome atlas. Cell Rep 23(1):172–180. https://doi.org/10.1016/j.celrep.2018.03.046
Article CAS PubMed PubMed Central Google Scholar
Eraslan G et al (2019) Single-cell RNA-seq denoising using a deep count autoencoder. Nat Commun 10:390
Article CAS Google Scholar
Dhahri H, Rahmany I, Mahmood A, Al Maghayreh E, Elkilani W (2020) Tabu search and machine-learning classification of benign and malignant proliferative breast lesions. Biomed Res Int. https://doi.org/10.1155/2020/4671349
Article PubMed PubMed Central Google Scholar
Liu X, Zhang Y, Fu C, Zhang R, Zhou F (2021) EnRank: an ensemble method to detect pulmonary hypertension biomarkers based on feature selection and machine learning models. Front Genet 12:636429. https://doi.org/10.3389/fgene.2021.636429
Article CAS PubMed PubMed Central Google Scholar
Lee K, Jeong Ho, Lee S et al (2019) CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network. Sci Rep 9:16927. https://doi.org/10.1038/s41598-019-53034-3
Article CAS PubMed PubMed Central Google Scholar
Tang X, Shi Z, Jin M (2021) Multi-category multi-state information ensemble-based classification method for precise diagnosis of three cancers. Neural Comput & Applic. https://doi.org/10.1007/s00521-021-06211-3
Article Google Scholar
Ilyas QM, Ahmad M (2021) An enhanced ensemble diagnosis of cervical cancer: a pursuit of machine intelligence towards sustainable health. IEEE Access 9:12374–12388. https://doi.org/10.1109/ACCESS.2021.3049165
Article Google Scholar
Francesconi M, Remondini D, Neretti N et al (2008) Reconstructing networks of pathways via significance analysis of their intersections. BMC Bioinform 9:S9. https://doi.org/10.1186/1471-2105-9-S4-S9
Article CAS Google Scholar
Zura K, Willie Y (2017) K-means and cluster models for cancer signatures. Biomol Detect Quantif 13:7–31
Article Google Scholar
Yu G, Yu X, Wang J (2017) Network-aided Bi-clustering for discovering cancer subtypes. Sci Rep 7:1046. https://doi.org/10.1038/s41598-017-01064-0
Article CAS PubMed PubMed Central Google Scholar
Leukemia Dataset https://web.stanford.edu/~hastie/CASI_files/DATA/leukemia.html. Accessed 25 Dec 2020
SRBCT Dataset https://research.nhgri.nih.gov/microarray/Supplement/. Accessed 25 Dec 2020
Colon Dataset http://genomics-pubs.princeton.edu/oncology/. Accessed 25 Dec 2020
Microarray Data Sets ftp://stat.ethz.ch/Manuscripts/dettling. Accessed 25 Dec 2020
Prostate Dataset https://leo.ugr.es/elvira/DBCRepository/ProstateCancer/ProstateCancer.zip
Lymphoma Dataset https://llmpp.nih.gov/lymphoma/, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60
Alon U, Barkai N, Notterman DA et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array. Proc Natl Acad Sci USA 96:6745–6750
Article CAS Google Scholar
Snoek J, Larochelle H, Adams RP (2012) Practical Bayesian optimization of machine learning algorithms. In: NIPS’12: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2:2951–2959
Bergstra J, Yamins D, Cox DD (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In ICML’13: Proceedings of the 30th International Conference on International Conference on Machine Learning, 28:115–123
Wu J, Chen X, Zhang H, Xiong L, Lei H, Deng S (2019) Hyperparameter optimization for machine learning models based on bayesian optimization. J Electron Sci Technol 17(1):26–40
Google Scholar
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(1):281–305
Google Scholar

Download references

Acknowledgements

Authors are thankful to the reviewers for very useful comments and suggestions which greatly improved this work and its presentation.

Funding

This research work is funded by the Department of Science and Technology (DST), Government of India, under the scheme DST ICPS 2018.

Author information

Authors and Affiliations

School of Computer Science and Engineering, REVA University, Bangalore, Karnataka, 560064, India
Nimrita Koul & Sunilkumar S. Manvi

Authors

Nimrita Koul
View author publications
You can also search for this author in PubMed Google Scholar
Sunilkumar S. Manvi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nimrita Koul.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koul, N., Manvi, S.S. Framework for classification of cancer gene expression data using Bayesian hyper-parameter optimization. Med Biol Eng Comput 59, 2353–2371 (2021). https://doi.org/10.1007/s11517-021-02442-7

Download citation

Received: 25 February 2021
Accepted: 13 September 2021
Published: 05 October 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s11517-021-02442-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Framework for classification of cancer gene expression data using Bayesian hyper-parameter optimization

Abstract

Graphic Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Unleashing the power of machine learning in cancer analysis: a novel gene selection and classifier ensemble strategy

Multi-population adaptive genetic algorithm for selection of microarray biomarkers

Iterative ensemble feature selection for multiclass classification of imbalanced microarray data

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now