Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture

Kishore, Akash; Venkataramana, Lokeswari; Prasad, D. Venkata Vara; Mohan, Akshaya; Jha, Bhavya

doi:10.1007/s11517-023-02892-1

Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture

Original Article
Published: 02 August 2023

Volume 61, pages 2895–2919, (2023)
Cite this article

Medical & Biological Engineering & Computing Aims and scope Submit manuscript

445 Accesses
2 Citations
Explore all metrics

Abstract

Prediction of the stage of cancer plays an important role in planning the course of treatment and has been largely reliant on imaging tools which do not capture molecular events that cause cancer progression. Gene-expression data–based analyses are able to identify these events, allowing RNA-sequence and microarray cancer data to be used for cancer analyses. Breast cancer is the most common cancer worldwide, and is classified into four stages — stages 1, 2, 3, and 4 [2]. While machine learning models have previously been explored to perform stage classification with limited success, multi-class stage classification has not had significant progress. There is a need for improved multi-class classification models, such as by investigating deep learning models. Gene-expression-based cancer data is characterised by the small size of available datasets, class imbalance, and high dimensionality. Class balancing methods must be applied to the dataset. Since all the genes are not necessary for stage prediction, retaining only the necessary genes can improve classification accuracy. The breast cancer samples are to be classified into 4 classes of stages 1 to 4. Invasive ductal carcinoma breast cancer samples are obtained from The Cancer Genome Atlas (TCGA) and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) datasets and combined. Two class balancing techniques are explored, synthetic minority oversampling technique (SMOTE) and SMOTE followed by random undersampling. A hybrid feature selection pipeline is proposed, with three pipelines explored involving combinations of filter and embedded feature selection methods: Pipeline 1 — minimum-redundancy maximum-relevancy (mRMR) and correlation feature selection (CFS), Pipeline 2 — mRMR, mutual information (MI) and CFS, and Pipeline 3 — mRMR and support vector machine–recursive feature elimination (SVM-RFE). The classification is done using deep learning models, namely deep neural network, convolutional neural network, recurrent neural network, a modified deep neural network, and an AutoKeras generated model. Classification performance post class-balancing and various feature selection techniques show marked improvement over classification prior to feature selection. The best multiclass classification was found to be by a deep neural network post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with a Cohen-Kappa score of 0.303 and a classification accuracy of 53.1%. For binary classification into early and late-stage cancer, the best performance is obtained by a modified deep neural network (DNN) post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with an accuracy of 81.0% and a Cohen-Kappa score (CKS) of 0.280. This pipeline also showed improved multiclass classification performance on neuroblastoma cancer data, with a best area under the receiver operating characteristic (auROC) curve score of 0.872, as compared to 0.71 obtained in previous work, an improvement of 22.81%. The results and analysis reveal that feature selection techniques play a vital role in gene-expression data-based classification, and the proposed hybrid feature selection pipeline improves classification performance. Multi-class classification is possible using deep learning models, though further improvement particularly in late-stage classification is necessary and should be explored further.

Graphical Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine learning for risk stratification of thyroid cancer patients: a 15-year cohort study

Article 30 October 2023

Applications and Techniques of Machine Learning in Cancer Classification: A Systematic Review

Article Open access 11 September 2023

Breast Cancer Prediction: A Comparative Study Using Machine Learning Techniques

Article 01 September 2020

Data availability

The dataset used in this research work is obtained from METABRIC [6] and TCGA [31].

References

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Zheng X (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. Retrieved July 31, 2023, from https://www.tensorflow.org/
Ahmed O, Brifcani A (2019, April) Gene expression classification based on deep learning. In 2019 4th Scientific International Conference Najaf (SICN). IEEE, pp 145–149
American Cancer Society (2021, June 28) Stages of breast cancer: Understand breast cancer staging. Retrieved October 25, 2021, from https://www.cancer.org/cancer/breast-cancer/understanding-a-breast-cancer-diagnosis/stages-of-breast-cancer.html
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A (2013) NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res 41:D991–D995
Article CAS PubMed Google Scholar
Castillo D, Gálvez JM, Herrera LJ, Román BS, Rojas F, Rojas I (2017) Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling. BMC Bioinformatics 18(1):506
Article PubMed PubMed Central Google Scholar
cBioPortal for Cancer Genomics (2016) Breast cancer (METABRIC, Nature 2012 & Nat Commune 2016). Retrieved May 25, 2022, from http://www.cbioportal.org/study/summary?id=brca/_metabric
Daoud M, Mayo M (2019) A survey of neural network-based cancer prediction models from microarray data. Artif Intell Med 97:204–214
Article PubMed Google Scholar
Dertat A (2017, October 9) Applied deep learning — part 1: Artificial neural networks. Medium. Retrieved October 25, 2021, from https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(02):185–205
Fathi H, AlSalman H, Gumaei A, Manhrawy II, Hussien AG, El-Kafrawy P (2021) An efficient cancer classification model using microarray and high-dimensional data. Comput Intell Neurosci 2021
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. Retrieved October 25, 2021, from https://www.deeplearningbook.org
Google Developers (2020, Feb 11) Classification: Precision and recall | Machine learning crash course. https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
Google Developers (n.d.) Classification: ROC curve and AUC. Retrieved May 25, 2022, from https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
Google (n.d.) Google Colab. Google Colaboratory. Retrieved May 25, 2022, from https://research.google.com/colaboratory/faq.html
Gosain A, Sardana S (2017) Handling class imbalance problem using oversampling techniques: A review. 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI)
Griffith M, Walker J, Spies N, Ainscough B, Griffith O (2015) Informatics for RNA sequencing: a Web resource for analysis on the cloud. Plos Comput Biol 11(8):e1004393
Article PubMed PubMed Central Google Scholar
Hambali MA, Oladele TO, Adewole KS (2020) Microarray cancer feature selection: Review, challenges and research directions. Int J Cogn Comput Eng 1:78–97
IBM Cloud Education (2020) What is deep learning? IBM. Retrieved October 25, 2021, from https://www.ibm.com/cloud/learn/deep-learning
Jin H, Chollet F, Song Q, Hu X (2023) AutoKeras: an AutoML library for deep learning. J Mach Learn Res 6:1–6
Google Scholar
Liang H, Zhou G, Lv L et al (2021) KRAS expression is a prognostic indicator and associated with immune infiltration in breast cancer. Breast Cancer 28:379–386. https://doi.org/10.1007/s12282-020-01170-4
Article PubMed Google Scholar
Lin Z, Ou-Yang L (2023) Inferring gene regulatory networks from single-cell gene expression data via deep multi-view contrastive learning. Brief Bioinforma 24(1):bbac586. https://doi.org/10.1093/bib/bbac586
Article CAS Google Scholar
Mignone P, Pio G, D’Elia D, Ceci M (2020) Exploiting transfer learning for the reconstruction of the human gene regulatory network. Bioinformatics 36(5):1553–1561. https://doi.org/10.1093/bioinformatics/btz781
Article CAS PubMed Google Scholar
Park A, Nam S (2019) Deep learning for stage prediction in neuroblastoma using gene expression data. Genom Inform 17(3)
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Duchesnay É (2011) Scikit-learn: Machine learning in python. J Mach Learn Res 12:2825–2830
Pereira B, Chin SF, Rueda O et al (2016) The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nat Commun 7:11479. https://doi.org/10.1038/ncomms11479
Article CAS PubMed PubMed Central Google Scholar
Rajbhandari P, Lopez G, Capdevila C, Salvatori B et al (2018May) Cross-cohort analysis identifies a TEAD4-MYCN positive feedback loop as the core regulatory element of high-risk neuroblastoma. Cancer Discov 8(5):582–599
Article CAS PubMed PubMed Central Google Scholar
Roy S, Kumar R, Mittal V, Gupta D (2020) Classification models for invasive ductal carcinoma progression, based on gene expression data-trained supervised machine learning. Sci Rep 10(1):1–15
Article Google Scholar
Scitable by Nature Education (2014) Gene Expression Is Analyzed by Tracking RNA. Retrieved May 25, 2022, from https://www.nature.com/scitable/topicpage/gene-expression-is-analyzed-by-tracking-rna-6525038/
Sun L, Kong X, Xu J, Zhai R, Zhang S (2019) A hybrid gene selection method based on ReliefF and ant colony optimization algorithm for tumor classification. Sci Rep 9(1):1–14
Google Scholar
Suzuki E, Sugimoto M, Kawaguchi K et al (2019) Gene expression profile of peripheral blood mononuclear cells may contribute to the identification and immunological classification of breast cancer patients. Breast Cancer 26:282–289. https://doi.org/10.1007/s12282-018-0920-2
Article PubMed Google Scholar
The Cancer Genome Atlas Program (n.d.) National Cancer Institute. Retrieved May 25, 2022, from https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga
UICC (2022) UICC and the TNM classification of malignant tumours. UICC. Retrieved May 25, 2022, from https://www.uicc.org/who-we-are/about-uicc/uicc-and-tnm-classification-malignant-tumours
Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH (2018) Relief-based feature selection: introduction and review. J Biomed Inform 85:189–203
Article PubMed PubMed Central Google Scholar
Viera AJ, Garrett JM (2005) Understanding interobserver agreement: the kappa statistic. Fam Med 37(5):360–363
PubMed Google Scholar
World Health Organization (2021) Breast cancer. World Health Organization. Retrieved October 25, 2021, from https://www.who.int/news-room/fact-sheets/detail/breast-cancer
Yao F, Zhang C, Du W, Liu C, Xu Y (2015) Identification of gene-expression signatures and protein markers for breast cancer grading and staging. Plos One 10(9):e0138213
Article PubMed PubMed Central Google Scholar
Yuan F, Lu L, Zou Q (2020) Analysis of gene expression profiles of lung cancer subtypes with machine learning algorithms. Biochimica et Biophysica Acta (BBA)-Mol Basis Dis 1866(8):165822
Article CAS Google Scholar
Yang ZJ, Yu Y, Chi JR et al (2018) The combined pN stage and breast cancer subtypes in breast cancer: a better discriminator of outcome can be used to refine the 8th AJCC staging manual. Breast Cancer 25:315–324. https://doi.org/10.1007/s12282-018-0833-0
Article PubMed Google Scholar
Zhong L, Meng Q, Chen Y, Du L, Wu P (2021) A laminar augmented cascading flexible neural forest model for classification of cancer subtypes based on gene expression data. BMC Bioinformatics 22(1):1–17. https://doi.org/10.1186/s12859-021-04391-2
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India
Akash Kishore, Lokeswari Venkataramana, D. Venkata Vara Prasad, Akshaya Mohan & Bhavya Jha

Authors

Akash Kishore
View author publications
You can also search for this author in PubMed Google Scholar
Lokeswari Venkataramana
View author publications
You can also search for this author in PubMed Google Scholar
D. Venkata Vara Prasad
View author publications
You can also search for this author in PubMed Google Scholar
Akshaya Mohan
View author publications
You can also search for this author in PubMed Google Scholar
Bhavya Jha
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Akash Kishore: literature survey, collected data set, data preparation and implementation. Lokeswari Y Venkataramana: evaluating the implementation, updating the manuscript, and reviewing the work. D Venkata Vara Prasad: reviewed the paper, suggestions on using different variations of dataset. Akshaya Mohan: literature survey, implementation, writing the manuscript. Bhavya Jha: literature survey, implementation, prepared figures and tables.

Corresponding author

Correspondence to Lokeswari Venkataramana.

Ethics declarations

Ethics approval

This article does not contain any studies with human participants or animals performed by any of the authors. All the authors have agreed to publish this manuscript. The manuscript is not submitted to any other journal or not under consideration of any journal.

Informed consent

Informed consent is not necessary as this article does not involve human or animal participants.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kishore, A., Venkataramana, L., Prasad, D.V. et al. Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture. Med Biol Eng Comput 61, 2895–2919 (2023). https://doi.org/10.1007/s11517-023-02892-1

Download citation

Received: 06 December 2022
Accepted: 19 July 2023
Published: 02 August 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11517-023-02892-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture